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ABSTRACT 

Directors of educational projects need to* be aware of 
the consequences their decisions may have for evaluation and 
appteciate the need for working closely with their evaluators from 
the earlies"*" planning stage* Attempting to address the needs of 
project directors and evalua-^ors/ this guidebook deals with one 
central aspect of project evaluations-measuring congnitive 
achievement gains. Its purpose is to provide thre tools needed to 
conduct technically sound, in terpreta ble evaluation studies. It 
cavers the ent^-re evaluation process from the administrative 
decisions in selecting an evaluation design to the details of 
collecting, analyzing, and reporting the data. After the 
introduction, Chapter 2 describes 12 hazards commonly encountered in 
evaluations which may invalidate otherwise sound studies. The hazards 
are disoussed, and ways to avoid the hazard are outlined. Chapter 3 
presents a procedural guide, in decision-tree form, for selecting a 
suitable evaluation model given a particular set of constraints. 
Chapter 4 presents the five evaluation models referred to in Chapter 
3. Each model is summarized describing its characteristics, 
strengths, weaknesses, and considerations relating to its 
implementation. Chapter 5 deals with the details of data collection 
and Chapter 6 with summarizing and reporting of impact data. 
Appendices con'^ain characteristics of some commonly used standardized 
tests and analysis of covariance ^vorksheets. (RC) 
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I. INTRODUCTION 



Purpose and Scope 

The evaluation of any special instructional project is affected by 
decisions made at all levels of project administration and at all stages 
of planning and implementation. All too often an evaluation specialist 
is brought in after a project is well under way only to find that actions 
have already been taken which make it difficult, if not impossible, to 
perform any kind of meaningful impact assessment. 

To avoid this clearly undesirable situation, directors of educational 
projects need to be aware of the consequences their decisions may have 
for evaluation and to appreciate the need for working closely with their 
evaluators from the .earliest planning stage. This guidebook attempts 
to address the needs of project directors as well as evaluators, and the 
next section of this chapter specifically designates certain sections 
as "recommended reading" for project directors. 

The guidebook deals with only one central aspect of project evalua- 
tion, measuring cognitive achievement gains. It is not concerned with 
project costs or with any ^ffecCive benefits which project participants 
may accrue. Neither does it address any such "process" variables as 
how well the objectives of the project were stated, how well the needs 
of the children were assessed, or how closely teachers followed prescribed 
instructional strategies. The entire focus is on obtaining as clear and 
unambiguous an answer as possible to the question, "How much more did pupils 
learn by participating in the project than they would have learned with- 
out it?" 

The guidebook is the result of a search by the authors for effective 
compensatory reading and mathematics projects (Tallmadge, 1974). The 
search encompassed some 2,000 projects, all of which had received some 
form of "official" recognition for success. Of the 2,000, only six 
could be found which, under close scrutiny, were able to meet the selec- 
tion criteria of effectiveness, cost, availability, and replicability 



established for this search (see Foat, 1974). Most discouraging, however, 
was the fact that not one of the evaluations provided acceptable evidence 
regarding project success or failure. In all cases, problems in conducting 
and reporting the evaluations rendered the results inconclusive. Obviously, 
practical considerations prevent school evaiuators from doing controlled, 
laboratory experiments, but many of the problems in current evaluation 
practices could be avoided with little or no increase in cost or effort. 
The rigor of laboratory experimentation may be beyond reach, but the state- 
of-the-art can be greatly improved without placing unrealistic demands 
on schools or evaluation resources. 

The purpose of this guidebook is to provide those concerned with 
project evaluation with the basic tools they need to conduct technically 
sound, interpretable evaluation studies. Every effort has been made to 
minimize the amount of technical sophistication required of users of the 
guidebook. It deliberately avoids exotic designs and focuses insCead on 
five ba.sic models which appear feasible to implement in real-world settings. 
Despite this orientation, it must be acknowledged that evaluation is not, 
and cannot be made simple. Particularly where situational constraints 
force adoption of statistical rather than experimental controls for 
extraneous influences, theoretical and computational complexities multiply 
at an astonishing rate. 

It seems likely that some potential users of the guidebook will find 
certain sections overly technical. On the other hand, those readers who 
can follow the more difficult portions may find much that seems trivial 
or unnecessary. Perhaps the best that can be hoped is that a reasonable 
compromise has been found between the inherent complexity of the total 
evaluation problem and the need to accomplish meaningful assessments with- 
out placing unreasonable demands on the technical expertise of the evaluator. 

Orga ni zation and Content 

The guidebook covers the entire evaluation process from tiie admin- 
istrative decisions in selecting an evaluation design to the details of 
collecting;, analyzing, and reporting the data. Many of the details will 
be of interest primarily to thi^ evaluation specialist and the project 



director may skip those sections without detriment. The following para- 
graphs summarize the topics and indicate the audience for whom each was 
intended , 

The final sections of this chapter describe Evaluation Basics and 
Prelininary P lanning . These sections are quite brief and should be read 
by project directors as well as those concerned with the details of pro- 
ject evaluation. 

Chapter II describes 12 hazards which are commonly encountered in 
educational evaluation and which may completely invalidate otherwise 
sound studies. Each of the hazards is named and then described very 
briefly. Material is then presented discussing why the hazard may in- 
validate impact assessment. Finally, there is a section on how the 
hazard can be avoided. 

The 12 presentations are not lengthy and should be read by both pro 
ject directors and evaluators. As a minimum, project directors should 
read the summary statement of each hazard in order to recognize the 
practices and to realize that they must be avoided if a valid evaluaiion 
is to be done. 

Chapter III presents a procedural guide for selecting a suitable 
evaluation model given the particular set of constraints faced by the 
project director and evaluator. The entire procedure is presented in 
decision-tree form (see Figure 1, p. 47) with each decision point rep- 
resented by a question followed by a choice of two alternatives (e.g., 
Is a comparison group evaluation design feasible?). Each question is 
discussed on separate pages which describe the implications of the de- 
cisions and the alternative courses of action available to the evaluator. 

It Is strongly recommended that the project director as well as the 
evaluator read Chapter III. Portions of some of the discussion sections 
become quite technical and may be skipped by the project director, but 
it is important that he be familiar with the evaluation options open to 

him and with t>he consequences of the decisions he must make. 

i 

Chapter 'IV presents the five evaluation models referred to in 
Chapter HI. There is a hriof summary of each model which describes 
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its general characteristics, its strengths, its weaknesses, and consider- 
ations relating to its implementation. The summaries should^ be carefully 
read by^ the project director. 

Each summary is followed by several pages of step-by-step instruc- 
tions for implementing the model except in one instance whe^re the compu- 
tational procedures were judged too complex for inclusion. 

The sections on implementation are intended for use by evaluation 
specialists and are somewhat technical. It is assumed that the reader 
will have had at least one college-level course in elementary statistics 
and will be familiar with and able to compute means, standard deviations, 
and correlations. No further expertise should be required to follow the 
implementation procedures, although the underlying concepts and rationales 
may not always be understood. Consultation with a statistician is ad- 
visable for evaluators not familiar with the concepts of covariance and 
regression if models employing these statistical procedures are selected. 

The format of the guidebook is such that the design selection pro- 
cedure in Chapter III (decision-tree) will automatically lead the reader 
to only one of the five models described in Chapter IV. He would thus 
not need to read any of the other model descriptions. Preliminary ex- 
perience with these chapters, however, suggests that they are interactive, 
and that reading about the alternative models — particularly the sections 
dealing with the considerations relevant to implementation — will often 
lead to a rethinking of the decision made in the design selection pro- 
cedure. For this reason, at least a superficial reading of all of the 
model descriptions is recommended before a final model selection is made. 

Chapter V deals with the details of data collection and Chapter VI 
with summarizing and reporting of impact data. These chapters need not 
be of great concern to project directors although a cursor> review of what 
they contain might facilitate understanding and communication with the 
project evaluators. 

Several appendices are also provided which expand upon issues 
raised in the body of the guidebook. These appendices, of course, are 
intended primarily for evaluation specialists and need not concern project 
directors. 

o 10 . 
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Eva luation Basics 

To find out whether students do better in a special project than they 
would have done without it, the evaluator needs two things: a good measure 
of how the students performed after their project participation, and an 
accurate estimate of how they would have done without" the project. The 
difference between these measures provides an index of the project's im- 
pact. In order to get a good measure of how students performed, the 
evaluator must select an appropriate test and ensure that it is adminis- 
tered and scored correctly. Often, the catalog of available tests will 
not include one with exactly the characteristics desired for assessing a 
particular project. However, most standardized reading and math tests 
are sensitive to any significant cognitive growth and should usually 
prove adequate for assessing the impact of special treatments. Objective- 
referenced or criterion-referenced tests are also suitable assuming that 
they have been carefully constructed. Tests and testing are discussed 
further in Chapter V of this guidebook and in Appendix A. 

A more difficult problem lies in estimating how students would have 
done without the project. In university laboratory studies, the experiences 
of randomly selected comparison groups are controlled so as to be identical 
to those of the experimental group in all respects except for the variable 
of interest. This approach is rarely a viable option in school projects. 
A variety of substitute approaches are commonly used but all are in varying 
degrees less satisfactory. The worst of these alternatives are included 
in Chapter II as "hazards" and make evaluations meaningless. The best 
are included in Chapter IV with recommendations on when they should be. 
used and explanations of their strengths and limitations. ' - 

Chapters V and VI also suggest, as mentioned above, ways of analy- 
zing, interpreting, and reporting results. Details of recommended procedures 
aie included there whUe characteristics of some widely used commercial / 
reading and mathematical achievement tests are included in Appendix A. 

Pre limin a ry Planning 

Ideally, the planning of an evaluat ion should proceed concurrently 
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with the planning of the project to be evaluated. Obviously, this is 
not always possible, but it is importatnt to be aware of the fact that 
some project decisions have important implications for evaluation, and vice- 
versa. Project-related decisions may, in fact, preclude the possibility 
of conducting any meaningful kind of evaluation. 

One area where close coordination between project and evaluation 
planning is absolutely essential is that of selecting project participants. 
Several possibilities exist: 

(a) All children comprising a particular group (e.g., all third 
graders) may be given a special supplementary project 

(b) participants may be randomly selected from an identifiable 
group or population, or 

(c) eligibility for participation may depend on the special 
needs of some members of a larger group (e.g., disadvantaged, 
gifted) . 

Each of these alternative participant selection pl^rns fits one or more 

of the models presented later in this guidebook, but is incompatible with, 

or places special restrictions on others. 

A second area where coordinated planning is required is the matching 
of evaluation models with test instruments. Criterion-referenced tests 
can be used with all but the norm -referenced model which requires stan- 
dardized tests. The norm-referenced model not only requires that stan- 
dardized tests be used but that the same level of the same test be used 
for both pre- and pogttesting and that the testing be accomplished at 
exactly prescribed times during the year*. 

When a project director makes an "executive" decision to u$e a 
specific type of test or some particular method of selecting participants 
he severely limits the number of evaluation models which can be used and 
may substantially reduce the conclusiveness of his assessment as well. 
The assumption made in this guidebook is that his first concern will be 
for conclusive findings. Accordingly, he will wish to consider the 
feasibility, practicality, and limitations of the jiore scientifically 

12 
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sound evaluation modelvS before restricting his choice's through hasty de- 
cisions about tests or participant selection procedures. In accordance 
with this orientation, the model selection procedure illustrated in Figure 
1 presents the models arranged in order of decreasing rigor from the top 
to the bottom of the figure so that the evaluation planner can see the 
consequences of each of his decisions. 

Once a model is selected through the decision-tree process, the 
evaluation planner can read about its strengths and weaknesses and about 
the conditions and restrictions associated with its use. Careful study 
of the remaining four models may suggest alternatives that appear more 
desirable. At that time, he might decide to reject his first choice, 
re-enter the decision tree, and select another model. 

The decision points in the model selection procedure all relate 
to the manner in which no-treatment, posttest performance expectations 
are generated. Even where the most rigorous model is selected, however, 
there are many possibilities for implementation errors which cculd inval- 
idate the entire evaluation. The next section of this guidebook des- 
cribes twelve of the most commonly encountered^ hazards, their consequences, 
and what should be done to avoid them. These common hazards should be 
studied carefully before any evaluation is undertaken. 



II. Zomm HAZARDS IN EVALUATION 

fhis section describes twelve common hazards in evaluation, the 
problems they create, and the ways in which the problems can be avoided. 
Iht; ocrurrence of any one of the twelve may completely invalidate an other- 
wise sound evaluation. The hazards include the following: 

1. The use of grade-equivalent scores. 

2. The uSe of gain scores. * 

3. The use of norm -group comparisons with inappropriate test dat*^s. 
^. The use of inappropriate levels of tests. 

5. The lack of pre- and posttest scores for each project par- 
ticipant. 

6. The use of non-comparable treatment and comparison groups. 

7. The selection of project participants based on pretest scores. 
The assembling pf a matched comparison group after the project 
participants are'selected. 

9. The careless administration or scoring of tests. 

10. The assumption that an achievement gain is due to the treat- 
ment when, in reality, it is due to some other factor. 

11. The use of non-comparable pretests and posttests. 

12. The use of inappropriate for^l^'s to estimate posttest scores. 

Although subsequent sections of this guidebook refer back to spe- 
cific hazards, it is strongly .r^^^on^^^rid^^d that the reader study all of 
the hazards before going on to othcx material. 



14 

O 8 

ERIC 



Hazard 1 



The use of gi;ade-equivalent scores. 



Grade-equivalent scores provide an insensitive, and in some instances sys 
tematically distorted, assessment of a project's impact. 



Why is this a hazard ? 

There are three serious problems with grade-equivalent scores: 

A. The concept of a '*grade-equivalent" score is misleading; For 
example: a grade-equivalent score of seven attained by a fifth grader on 
a math test does not mean that he knows sixth- and seventh-grade math. I 
is more accurate to say that he can do fifth-grade math as well as .an 
average seventh grader can do fifth-grade math although even this repre- 
sentation is not strictly accurate. It is quite possible, in fact, that 
when the test was normed no seventh graders ever took the level of the 
test intended for use in the fifth grade. In such cases, the seventh- 
grade grade-equivalent scores reported in the test manual are simply sta- 
tistical projections and tell us little about how seventh graders would 
have actually scored if they had taken the fifth-grade test. 

B. Grade-equivalent scores do not comprise an equal-interval seal 
That is, a grade-equivalent score of two is not in any sense "half" of 

a score of four . For this reason, "average" grade-equivalent scores are 
not consistent with averages computed from more appropriate kinds of 
scores and are not interpretable* 1- 

C. "The normative data for many commercial tests are collected 
during one short interval of the school year, often in February or March. 
In order to establish norms for fall and spring, a smooth curve is drawn 
connect iniL^, the points which u^present actual data. Unfortunately, there 
iiy substantial evidence that learning does not proceed uniformly over 

1^ 



the calendar year. One factor that contributes to the irregular learning 
pattern^ is the effect of reduced gains or even forgetting during the sum- 
mer months. As a result, the current procedures used to "generate grade- 
equivalent scores tend to make them, systematically too low in the fall and 
too high in the spring. Fall to^ spring gains thus appear to make a pro- 
ject look unusually effective when actually the gains were exactly the 
same as woulS be expected under normal classroom conditions. Even where 
using grade-equivalent scores does not introduce systematic biases (as 
would be the case if there were a 12-month pre-to-posttest interval), the 
curve-fitting procedures used jto generate such scores introduce errors 
large enough to invalidate any evaluation. 

How can the hazard be avoided ? 

There is never any technically sound reason for using grade-equivalent 
scores in evaluating projects and they should be avoided. Standardized 
tests can still be administered. However, raw scores should be converted 
to standard scores instead of grade-equivalent scores before summary 
statistics are computed. Mean pretest and posttest standard scores should 
then be converted to their percentile equivalents. Finally, pre-to-post- 
test gains can be compared against expectations derived from the national 
norms, but only if the tests were administered at appropriate times (see 
Hazard 3). Further discussions of the problems created by using grade- 
equivalent scores for evaluation purposes can be found in Tallmadge and 
Horst (1974, Appendix E) . 
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Hazard 2 



The use of gain scores. 



There-are several kinds of gain scores and they are generally used in an 
attempt to. adjust for initial differences between treatment and comparison 
groups in conventional experimental designs. Where such differences are 

real, they cannot be adequately adjusted. Where they are the restlt of 

random sampling fluctuations, "raw** gain scores overcorrecl for between- 

group differences and "residual" gain scores are likely to undercorrect . 



Why is this a hazard ? 

The most commonly encountered type of gain score is the "raw" gain 
score which is simply the posttest score minus the pretest score. The 
term "raw" does not refer to the type of pretest or posttest scores (raw, 
standard, percentile, etc.) used to deteraine the gain but to the gain 
itself. The category of raw gain scores thus includes gr a d e -e q u i va lent 
gains . 

If differences between treatment and comparison groups are random, 
(i.e., the two groups may be regarded as random samples from a single pop- 
ulation) then raw gain scores overcorrect for pretest differences by 
excessively inflating the posttest performance measure of the initially 
inferior group. Analysis of covariance provides an appropriate means 
of adjusting for random pre-treatment differences between groups. 

In certain theoretical situations where differences between treat- 
ment and comparison groups are real (i.e., the groups are samples from 
different populations) gain scores may represent the best method for 
equating the groups. In real-world educational evaluations, however. 
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factors such as differential growth rates, different tet^t score relia- 
bilities as a function of achievement levels different reliabiliti^es on 
pre and post measures, and test floor effects all work to invalidate 
this type of adjustment (Campbell, 1974). The authors rejec t statistical 
techniques for equating truly non-comparable groups in conventional ex- 
perimental designs. Such groups only permit defensible conclusions of 
effectiveness in those rare instances when an initially inferior treat- 
ment group outperforms the initially superior comparison group on the 
posttest . 

^ " ^^es^-<^ua^" g^^" score is not a gain score at all. It is the 
difference between an actual posttest score and an estimated posttest 
score where the estimate has been derived from the pretest scores using, 
regression techniques. Whenever there is a pretest difference between 
treatment and comparison group means, residual gain scores systematically 
undercorrect . The amount of undercorrection is directly proportional to 
the size of the between-group difference. ^ 

Tallmadge and Horst (1974, pp. 36-37) presents a further discussion 
on gain score problems. 

How can the hazard be avoided ? 

Gain scores should never be used. Where pretest scores are equal 
for treatment and comparison groups, there is, of course, no need for the 
kind of adjustment gain scores are supposed to provide. \^ere between- 
group differences result from random sampling fluctuations, covarlance 
analysis is the appropriate technique to use. Where the differences are 
real, and the groups are truly non-comparable, there is no adequate tech^ 
nique for equating them and conventional comparison group evaluation models 
should not be used. Appropriate alternative models are recommended in 
Chapter III. 
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Hazard 3 



The use of norm-group comparisons with inappropriate test dates. 



Administration of tests on dates which do not correspond to the date when 

the actual normative data were collected invalidates norm-referenced com- 

« 

parisons. 



Why is this a hazard ? 

When comparison groups are available, few evaluators would even con 
sider testing treatment and comparison students more than a few days apart 
When a norm group used for comparison, this- issue appears to be given 
. little thought. The problem stems from two misleading practices followed 
by test publishers. First, interpolation or extrapolation processes 
are used to "create" norms for periods when no "reaX" normative data 
were collected. Thus,^ most publishers provide norms for fall, winter, and 
spring even though data were collected at only one or possibly two of 
these points. Projected norms are generally based on the assumption of 
linear cognitive growth over each month of the nine months of .the school 
year with one additional month's gain over the three summer months. There 
is no evidence to support this assumption and the created norms are likely 
to be far enough off to distort the impact of special instructional 
projects. Norms based on projected estimates should never be used for 
evaluation purposes. 

The second practice is the suggestion, implicit in most norms table 
that the norms are valid over a three- or even a four-month period. For 
this to be the case, children would have to learn nothing over the entire 
period, then show a large gain overnight at the end of the period and so 
on. This assumption is clearly absurd. If the norms are corVect some- 
where in the middle of the time period, they will be systematically too 



19 



low at the beginning and systematically too high at the end o^ the period. 
The errors involved are quite large and can give a severely distorted 
picture of project impact. (See Tallmadge £, Horst, 1974, Appendix D, 
p. 67.) 

Hoy can the hazard be avoided ? 

It is absolutely essential to test children in the treatment con- 
dition within a week or so of the dates on which the norm groups were 
tested. Tests which provide normative data for only one point in the 
year should not be used for norm -referenced evaluation of fall-to-spring 
gains. Instead it is better to select a test with normative data in botK 
fall and spring even though the choice of tests is then limited. Basically, 
it is never advisable to extrapolate or interpolate very far from observed 
normative data. 
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Hazard 4 



The use of inappropriate levels of tests. 



If most of the pupils tested are getting nearly all, or hardly any of the 
test items correct, the level of the test is inappropriate for assessing 
their cognitive achievement status. Measurement under these conditions 
is both unreliable and invalid. Ideally, the pupils tested should score 
in Che middle of the range of possible raw scores. 



Why is this a hazard ? 

The major standardized achievement tests are divided into several 
levels which cover different grades or grade bands. Each level is an 
individual tejst appropriate for only two or three grades. In the case 
of projects aimed at slow or fast learners, the test level nominally 
designated for their grade is likely to be too difficult (pupils will 
encounter the test "floor**) or too easy (pupils will encounter the test 
"ceiling") and would not provide a reliable and valid measure of achieve- 
ment. Ceiling and floor effects may cause similar distortions in evalu- 
ations using criterion-referenced tests. 

How can the hazard be avoided ? 

Test levels should be selected on the basis of the achievement 
levels of the pupils, not on the basis of their grade in school. Usually, 
one level above or below that nominally recommended for a particular 
grade will be sufficient to avoid ceiling and floor effects, but no firm 
recommendation can be made as diffi^culty levels and ranges of coverage 
vary greatly from instrument to instrument. 

Using test levels other than those nominally recommended for 
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particular grade levels is likely to mean that norms tables for the grades 
tested are not included in the test manuals. This is unfortunate since 
it is clearly not meaningful to assess either 8tatu.-> or growth through 
comparisons with children at a different grade level. The status of a 
sixth grader should be assessed using sixth-grade norms even if he is 
tested with a fourth-grade test. If a comparison group is available, 
there is no problem because growth is assessed with reference to the com- 
parison pupils — not to the norms. With norm-referenced evaluat ion- models , 
however, there may be a problem. Fortunately, most major test publishers 
have interlocked their test levels by providing overlapping grade-level 
coverage. This practice has enabled the development of score equivalen- 
cies between adjacent tcbt levels so that it is possible to predict quite 
accurately froni a pupil * score on one test level how he would have scored 
on the next higher or lower level. 

From thf between-level score equivalencies, it is common practice 
to develop a single score scale spanning all test levels so that raw 
scores from any level can be converted to scores on this scale (scores 
of this type are often called scale, standard, or expanded standard scores). 
Scale scores can be referenced to any set of normative data. Thus, scores 
of sixth graders tested with a fourth-grade test can be converted to sixth- 
grade percentiles and it is not necessary to use a test which is likely 
to be too easy or too difficult for the particular children being tested. 
While there are generally some measurement errors which result from im- 
perfect interlocking, typically they are smaller than those which result 
from encountering test ceilings or floors. 

Whatever level of a test is selected for use, that same test level 
should be used for both pre- and posttesting (see Hazard 10). 
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Hazard 5 



The lack of pre- and posttest scores for each treatment participant. 



Analyses of project impact should be based only on those participants with 
both pre- and posttest scores. Interpretation of these data, however, 
should take into consideration the characteristics of pupils who dropped 
out, entered late, or graduated from the project. 



Why is this a hazard ? 

In most projects, the group that is ultimately posttested is not 
composed of exactly the same students as the pretest group due to drop- 
outs and new students during the school year. Therefore, pre- and post- 
test mean scores are not strictly comparable. In particular, it often 
seems that the dropouts from a special program are among the slowest 
students. Eliminating their low scores from the posttest may raise 
the mean posttest score considerably. On the other hand, some projects 
may return successful students to their regular classrooms, thus lowering 
the mean posttest score for the remaining group. It is not uncommon to 
find evaluation reports which include posttest scores for fewer than half 
of the reported project participants and any conclusions in such repoits 
are usually meaningless. 

How can the hazard be avo ided ? 

It is not possible^ to prevent students from dropping out of or 
entering a project after it has begun. Still, it is essential to base 
any conclusions about the impact of the project on the data from students 
who have both pre- and posttest scores, but even this is not enough. The 
pretest score distribution for all dropouts must be examined to see if 
it differs from that of the non-dropouts. Further, if the number of dropouts 
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is large, at least a brief investigation of the reasons for dropping out 
is required. Sometimes a project is targeted at certain children, and 
the dropouts may be either students who succeeded and returned to their 
regular classes leaving the unsuccessful students to be posttested, or 
they may be exactly the students for whom the program was intended, but 
who failed and left. 

In short, every effort must be made to obtain pre- and post test 
scores for each project participant. Pretest-posttes : comparisons must 
be based on those students for whom both scores are available. Data 
from students having only pretest or only posttest scores must be care- 
fully examined to see if they differ in some systematic respect from the 
data of students having both pre- and posttest scores. A description 
of any .of these differences should be included in the project evaulation 
Analysis of pre- and posttest scores is discussed further in Chapter VI. 
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Hazard 6 



The use of non-comparable treatment and comparison groups. 



In conventional experimental designs, treatment and compaxison groups must 
be comparable in all relevant variables before the treatment begins. 
Groups which differ in terms of pretest scores are an obvious source of 
bias. Other, more subtle factors^ such as differences in age, sex, race, 
or socioeconomic status can also e<ert strong biasing influences and must 
be avoided. In such designs, there is no way in which , a non-comparable 
comparison group can provide an accurate estimate of how well the treat- 
ment group would have done without the treatment. 



Why is this a hazard ? 

Students in a special program may do better or worse than compari- 
son groups simply because they were different to start with. One of the 
most cominon cases occurs when students who volunteer are put in the 
special program while the rest serve as a comparison group. Even given 
equal pretest scores, it is likely that the volunteers are a more enthu- 
siastic group and will learn more. This type of rather subtle difference 
is often overlooked. Of course, any obvious differences between treatment 
and conparison groups may also affect evaluation results and such variables 
as socioeconomic status, age, sex, racial and ethnic composition, and 
school size and setting should be carefully checked for comparability-. 

The problem is even more serious when norm-based comparisons are 
used. Volunteering or other selection procedures may result in a treat- 
ment group that is quite different from the norm-group students who got 
equal scores at pretest time. 

The net result in either case is that the comparison group provides 
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an inaccurate estimate of what project participants would have learned 
without the project treatment. Theoretically, the estimate may be either 
too high or too low. However, typical selection strategies usually lead 
to superior treatment groups. 

How ^ a n j: he__ h a z a r d be a v o Id e d ? , ^ 

Students should be assigned to treatment and comparison groups on 
a random basis or in such a way that a nonrandom assignment is r andom 
in effect (Lord, 1967, p. 38). Essentially, this means that the two 
groups tnuat^be similar along all educationally relevant dimensions, un- 
less the evaluation model specifically allows for group selection on the 
basis of pretest scores. This hazard and the steps to avoid it are close 
related to the previously discussed Hazard 2. 
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Hazard 7 

The selection of project participants based on pretest scores. 



\lhen students are selected for project participation based on their ob- 
taining relatively high or relatively low scores on some test, use of 
those scores as pretest measures invalidates any kind of norm-referenced 
evaluation. 



Why is th is a hazard ? 

This error has been so widely discussed and well documented that 
most evaluators are aware of the problem. Unfortunately, for various 
reasons it is still encountered. The error results from testing a large 
group of students, selecting the lowest (or highest) ones for a special 
program, and then treating the selection scores as pretest scores. This 
practice results in systematic distortions on pre- to posttest gains. 

It is veil known that if the low scoring students are retested on 
the same or a comparable test, they will score higher on the average, 
while an initially high scoring group will score lower. This phenomenon 
is called "regression toward the mean," or simply "statistical regression," 
and is discussed in virtually all texts on experimental design. The result 
is that low scoring groups appear to learn more from a special program 
than they actually do, while gains in special programs for high scoring 
students may be obscured. 

Statistical regression presents no problem for the special and 
general regression models presented in Chapter IV. Evaluations employing 
comparison groups may or may not be affected depending on whether the 
regression effect operates differently on the two groups. Hazard 8 treats 
a closely related situation in which the comparison group is selected on 
the ba'-us of pretest scores. Regression artifacts invalidate any kind of 
norm-referenced evaluation. 

21 



^ How can the hazard be avoided ? 

' Corrections for the regression effect are possible in theory, but 

in practice the necessary data are not usually available. Thus, it is 
safer to avoid the problem by not using the pretest to select project 
participants except for those regression models which specifically re- 
quire this approach. (See also Step 7, p. 23 of Tallmadge & Horst, 1974.) 
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Hazard 8 

The assembling of a matched comparison group after the project participants 
are selected. 



Finding "matches" for treatment participants in some other group is a 
fundamentally unsound practice. Unless Xhey and the treatment pupils 
are equally representaCive of the groups from wfiich they are dram, sta- 
tistical regression will act differentially on the two groups and arti- -sr 
ficially inflate the apparent gains of one group with respect to the othe^. 



Why is this a hazard? 



students 



It may be very useful to have a comparison group made up of 
carefully matched to the treatment students, but unless the proper ]()ro- 
cedures for selection are followed, comparisons between the two groups may 
be completely misleading. The common practice of selecting student^ for 
the treatment, then trying to find a non-treatment student to matc^t each 
treatment student is a serious evaluation error. If, for example, a. pro- 
ject is, set up for the most underachieving children in a disadvantaged 
school, it may be possible to construct a "matching" comparison group by 
finding 'children with equally low pretest scores in less disadvantaged 
schools. In this situation, the comparison students would be farther 
below the means of their own schools than the treatment children and their 
posttest scores would show a greater regression toward the mean. This 
regression artifact would thus inflate the apparent gains of the compari- 
son group with respect to the treatment group and might obscure a real 
project impact. 

How can the hazard be avoided ? 

The correct procedure for establishing matched comparison groups 
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is to do the matching first and then assign members of each pair randomly 
to the treatment or the comparison group* That is» a large group of stu- 
dents, all eligible to be in the project, must be available. The first 
step is to divide the group into matched pairs based on test scores, ethnir 
background, sex, etc., so that the two members of each pair are as similar 
as possible. Then, after the matching process is complete, some random 
procedure such as flipping a coin is used to decide which student goes 
into the treatment and which into the comparison group. KTiere this ap- 
proach is impossible, models which do not require matched groups should be 
selected. (See Chapter III.) 
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Hazard 9 



The careless administration or scoring of tests. 



Testing must be accomplished with scrupulous attention to detail. For 
most evaluation models, the primary requirement is that treatment and 
comparison groups be tested in exactly the same way. The norm-referenced 
evaluation model further requires that procedures outlined by test pub- 
lishers be followed precisely. 



Why is this af hazard ? 

Problems arise if tests are administered or scored in an inconsis- 
tent and careless manner. If there are differences in the ways in which 
the treatment students and the comparison students are tested or if there 
are differences in the procedures, conditions, and scoring at pretest and 
pdsttest times, then it is impossible for the resulting data to accurately 
reflect project inpact. No amount* of careful statistical analysis can 
later overcome these problems. 

How ran the ha zard be avoided ? 

a) Test procedures must be orderly and accurate if scores are to 
be meaningful, 

b) The treatment studt^its must be tested and scored in exactly 
the same way as comparison students, 

c) The procedures, conditions, and j>uoring methods during post- 
testing must be exactly the samv^as during pretesting. 

Properly trained personnel decrease the probability of) disorderly 
or ina(< urate testing procedures but problems may be introduced by local 
conditions and student attitudes. Students may not understand what is 
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expected of thera, or in extreme cases, they may become unruly and make no 
serious effort to answer test questions. Problems which occur due to care- 
lessness include .failing to get the right name on each answer sheet, using 
the wrong answer key or conversion tables, and making mistakes in copying 
scores onto data sheets. 

The second issue, comparability between the testing situations of 
the treatment and comparison groups, can and should be dealt with in a 
straightforward manner in comparison-group designs. In these cases, iden- 
tical procedures, even the use of a single tester, are possible. In the 
more common situations in which norm-group comparisons are made, the 
instructions accompanying the test must be followed exactly. 

The third issue, comparability between pre- and posttesting situa- 
tions, requires the same attention to procedures as the other issues. The 
real problem is often the pressure on teachers to show achievement gains 
which may lead them, intentionally or unintentionally, to be stricter in 
enforcing time limits and avoiding helpful hints on the pretest than when 
administering the posttest. This ^type of problem can be minimized by 
having an independent, external evaluator administer the tests or by having 
teachers within a school exchange classrooms so that each tests and scores 
another teacher's students. 

Chapter V is devoted entirely to the details of obtaining accurate, 
meaningful data. 
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Hazard 10 



The assumption that an achievement gain is due to the treatment when, in 
reality, it is due to some other factor. 



Other possible explanations always exist for observed gains. The plaus- 
ibility of these alternative explanations should be carefully examined 
before gains are attributed to project impact. 



Why is this a hazard ?. 

Sometimes project participants learn substantially more than would 
have been expected, but the project, per se , is not responsible. Instead, 
the gains could be a result of the Hawthorne effect (Whitehead, 1938) in 
wh;lch special project ^participants do well simply because they are getting 
special treatment. The nature of the treatment may not necessarily be 
important. An opposite result may follow from a John Henry effect 
(Saretsky, 1972). In this case, comparispnr-group students work extra hard 
to prove that they are just as good as project students. 

Other likely causes of misleading gains are unrecognized "treatments** 
which have nothing to do with the project. Most school systems are in a 
constant state of flux with multiple changes every year. Changes in school 
programs, personnel, facilities, class sizes, community characteristics — any 
or all of these factors can affect student performance. Also., the true 
source of achievement gains is sometimes improperly identified because 
children are involved in mo than one treatment. Under these conditions 
it is impossible to determine Causality in an unambiguous manner. 

How can the hazard be avoided ? 

When a carefully implemented evaluation reveals significant cog- 
nitive achievement gains, it should not be immediately assumed that the 




gains are solely the result of the special treatment. A variety of pther 
factors exist which may lead to the obtained results. Each plausible 
rival hypothesis should be examined and, where the evidence permits, elim- 
inated as a likely explanation. A discussion of the remaining factors 
and the relative likelihood of each as a contributor to the gains should 
be included in fhe evaluati< In succeeding years with a continuing 
project, some of these competing explanations might be controlled and 
^li&nated. 

} 
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Hazard 11 

The u«e of non-comparable pretest and post test. 



It is almost always a good idea to use the sare level of the same test 
for both pre- and posttesting. In norm-referenced evaluations, it is 
usually essential. 



Vfhy is this a hazard ? 

The situation ixi which pretests differ from posttests is frequently 
encountered in evaluation reports. Usually it occurs because there is 
a district-wide change in testing policy during the evaluation period in 
an attempt to find a more appropriate test for all district evaluations. 
The disruption of evaluations of ongoing projects is unavoidable, and 
may be completely beyond the control of the project evaluator. It may_ 
also, however, severely limit the usefulness of the evaluation and should 
be avoided if at all possible. The use of the same level of a test for 
both pre- and posttesting is also strongly advised. Some tests have inter- 
locked levels so that scores from one test level can be converted Into 
another. Howler, the.-^e conversion tables reflect a certain degree of 
measurement error n result of curve fitting, rounding, and successive 
t ronsform.it ions- . Ft Ls clearly preferable to use just one level of the 
test . 

in 1 ^ompnr ison-group design, the fact that the posttest differs 
from thp pretest may not be a critical problem. So long as pre- and post- 
tests are reasonably correlated, as will be true among the major commer- 
cial tests, the comparison- group students make reasonably convincing con- 
clusions possible. However, in the more common norm-referenced designs, 
there is no completely adequate way to compare pretest scores on one test 
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with posttest scores on a completely different test. Since each test is 
normed on a different group of students, this amounts to using one compar- 
ison group for the pretest, and a second comparison group for the post- 
test. 

How can the hazard be avoided ? , 

To insure comparability between the pre- and posttests in norm-ref- 
erenced evaluations, the only real solution is to administer the same 
level of the same test on both occasions. When that option is not availabl 
it still may be possible, in some instances, to approximate it through v 
the use of conversion tables provided in the Anchor Test Study (Loret, Sede 
Bianchini & Vale, 1974). The Anchor Test Study provides tables which 
may be used to convert scores on one test to their equivalents on each 
of the other tests in the study. Conversion errors are reported to be 
low, so in theory the procedure is sound, but, in any case, it applies 
only to the eight most commonly used reading tests covered by the study, 
and* only to grades 4, 5, and 6. 

In comparison group evaluations, switching from one standardized 
test to another is acceptable if both tests meet the requirements of this 
guidebook. The result is usually to lower pretest-posttest correlations 
and correspondingly to lower precision of the evaluation. Switching to 
an entirely dissimilar test i& to be strongly discouraged. 
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Hazard 12 



The use of inappropriate formulas to estimate posttest scores. 



^Under certain circumstances, it makes sense to expect that a pupil will 
maintain his relative status with respect to national norms from pre- to 
posttest if he does not participate in a special project. However, many 
methods have been devised for calculating performance level expectations 
which rest on clearly untenable assumptions. These methods of estimating 
performance levels should never be used. 



IiJhy is this a hazard ? 

f 

Many projects use an unrealistic theoretical model or formula to 
calculate "expected" posttest scores from IQ or other pretest scores. If 
students do better than the calculated expectation, the project is con- 
sidered a success. Estimated posttest scores are often based on average 
grade-equivalent scores. For example, a student who has gained 0.7 years 
per year, on the average, since beginning school is presumed to continue 
at the same rate unless a special program increases his rate. There are 
many problems with such an estimate, but the major one is in the use of 
grade-equivalent scopes ^see^Hazard 1) • The student who. averaged 0.7 
years per year over several years will usually appear to gain more than 
that if measured from fall to spring, giving a misleading impression of 
improvement. 

Most IQ-based estimates are both inaccurate and logically unreas- 
onable. For example, the Bond-Tinker formula (Della-Piana, 1968, p. H) 
is often used to compute an "expected" reading level, i.e., 

IQ 

Expected reading level = [ '^iqq — ] x [ No. of years in school ] + 1. 
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For a student with an IQ score of 85 (approximately one standard deviation 
below the mean) at grade level 7.1 (6.1 years of school completed); 

Expected reading level = (.85) x (6.1) + 1 = 6.2 

So the formula says he should be reading at the sixth-grade level. But 
since his IQ is supposed to be "mental age" divided by "chronological age," 
his mental age would be given by: 

MA = (IQ) X (CA) 

Assuming the seventh-grader is twelve years old: 

^ M.\ ^ (.85) x (12) = 10 years 

We now have a twelve-year old student with a mental age of ten years who 
is expected to read as well as an average sixth grader (11 years old). 
This is certainly inconsistent, but even worse, it is incorrect. According 
to normative data from the Gates-MacGinities reading test, a seventh-grade 
student one standard deviation below the mean is reading at the fourth- 
grade level. 

Because of these and many other theoretical and practical problems, 
the underlying concepts of the intelligence quotient have been abandoned 
by informed measurement specialists (Cronbach^ 1970, p. 216; Tyler, 1972, 
pr"!:??)* Wliile the commercial instruments which have been designed as 
"IQ tests" may have a variety of practical uses, they are not, in general, 
the best available predictors of specific school skills, and IQ scores 
are -not recommended for any purpose in evaluating the effects of special 
projects. 

How can the hazar d be ayoided ? 

In norm-referenced evaluation models, posttest scores can be esti- 
mated by referring to national norms. When comparison groups are used, 
the actual posttest scores of these groups, or a regresvsion equation tsli- 
mating the posttest scores, provide the proper basis f<M- evaluating treat- 
ment effects. 



o8 

32 



HI. A PROCEDUR/\L GUIDE FOR MODEL SELECTION 



This section presents a procedaral guide for selecting an evaluation 
model. By answering a series of questions relating to the real-world 
constraints under which the evaluation will be conducted, the reader is 
led to one of the five evaluation models presented in the following 
chapter. 

Figure 1 on page 47 summarizes the seven-step decision tree in flow- 
diagram form. Each step is discussed separately on the pages preceding 
Figure 1. (This page arrangement is intended to facilitate reference 
to the fold-out figure.) For each step, the- decision question is pre- 
sented along with two answer alternatives. A "ebmment" section is also 
included which explains the issue in question and the implications of 
choosing each alternative course of action. " ' 

The specific path to be followed through the decision tree depends 
on the answers the reader makes to each of the seven questions, .and in- 
structions on how to proceed are provided for each answer alternative* 
The reader should first read through the chapter and then make a selection 
by skipping from page to page in accordance with these instructions. 

Figure 1 also shows the five evaluation models which are discussed 
in Chapter IV. They are arranged in decreasing order of scientific rigor, 
with those at the tOp of the page enabling the evaluaton to draw substan- 
tially more conclusive inferences about project impact than those at the 
bottom. On the other hand, the feasibility of implementation is expected 
to operate in exactly the opposite direction so that the less rigorous 
models wiTl be much easier to use. While the more rigorous models are 
certainly to be preferred, any one of the five will yield believable results 
if carefully implemented. 
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Do practical considerations (policy, Availability, cost, time) 
permit you to select an evaluation design v;hich makes use 
of a local comparison group? 

Yes Proceed to Question 2 
No Go to Model 5, page 72 



In order to measure the impact of any special instructional 
treatment, it is essential to have some estimate of how 
the participants' would have fared under normal or non- 
treatment conditions. Since, presumably, the non-treatment ^ 
condition consists of participation in a regular school 
curriculum, some gains would clearly be expected even with- 
out the special project. The problem is to obtain a good 
estimate of how large the pupils' gains would have been 
under such conditions and subtract this estimate from the 
gains they actually, obtained in the special project. The 
difference is the incremental gain which can be attributed 
to project participation. 

There are two kinds of local comparison groups which can 
provide adequate estimates of npn-treatment expectations: 
(a) a conventional comparison group which is like the treat- 
ment group in all educationally relevant respects, and (b) 
a comparison group which results from splitting a pre-existing 
intact group into treatment and comparison subgroups at 
some pretest cutoff score. 

The best method of estimating non-treatment posttest scores 
is to find a group of pupils exactly like the project 
children and to treat them in exactly the same way with 
the single exception of withholding tb*^ special treatment 
from them. Their posttest scores will then constitute the 
best possible estimate of how well the treatment group would 
have done without the treatment. 

^0 



It is often not possible to obtain a sample of exactly 
comparable pupils to serve as a comparison group. Under 
appropriate conditions, however, group§ which are not 
strictly comparable can be 'used for estimating non-treat- 
ment performance. Model 3, in fact, divides a class or 
other pre-existing group into treatment and control sub- 
groups at some pretest cutoff score so that all pupils 
above the cutoff go into one group while all pupils below 
it go. into the other group. 

The issue of comparison group suitability and the implica- 
tions which the type of group has for selecting an eval- 
uation design are addressed in subsequent Questions. If 
either type of local comparison group is available, proceed 
to Question 2. 

Where no local comparison group is available, the evalua- 
tion must depend on comparisons between treatment-s.tudent 
scores and national norm-group data collected by the pub- 
lishers of standardized tests. This procedure is explained 
in Model 5, page 72. 
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Will pre-existing, intact groups or individual pupils be 
assigned to treatment and comparison conditions? 

Groups Skip to Question 5 
Pupils Proceed Co Question 3 

The most commonly encountered type of i"t:act gr oup is a 
classroom, a school, or a grade level -within a school. As- 
signment by groups would mean that one third-grade classroom 
was assigned to the treatment condition and another to the 
comparison condition — or that all third graders in one 
school comprised the treatment group while all third graders 
in another school constituted the comparison group. Third' 
graders from one school who were in the lowest quartile of 
the national distribution in reading could also be considered 
a pre-existing, intact group _if_ they were compared against 
similar children from another school. In all of these 
cases, the condition to which the pupils were assigned 
was determined entirely by their group membership without 
regard to any characteristics of the individuals. 

On the other hand, if all third graders were listed alplia- 
betically and alternately assigned to treatment and compari- 
son conditions, we would say that assignment was by individual 
£U£il. Another similar example would entail the pairing 
of children on the basis of their pretest scores with sub- 
sequent assignment of one member of each pair to the treat- 
ment group and the other to the comparison group. 

A quite different kind of assignment, but one still considered 
as signment by pupil s, involves the assignment of pupils who 
score below some selected cutof-^ point on a test to the 
treatment group and those scoring above rhat point to the 
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comparison group. In this case, some members of an intact 
group were assigned to the treatment condition and others 
to the comparison condition but it should be clear that 
assignment to conditions was based on considerations re- 
lating to individual characteristics? and not group member- 
ship. 

Assignment by pupil is generally preferred over assignment 
by group as this method offers greater control over poten- 
tz^illy biasing factors. The use of pre-existing groups is 
a viable alternative only where the groups are similar in 
all relevant respects to groups which would have resulted 
from assignment by pupil* 
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Question 3 

Is ft possible to assign pupils randomly to treatment and 
comparison groups, or will group membership be determined 
by need? 

Randomly Proceed to Question 4 
By Need Skip to Question 6 

Comment 

Random assignment implies that each child in a single "pool" 
or group has an equal chance of being assigned to the compar- 
ison or to the treatment group. One way to accomplish random 
assignment would be to place the names of all the children 
in a hat and then draw them out one at a time assigning 
every other child to the treatment group. There are other 
techniques which are equally suitable but the decision as 
to whether a child is assigned to one group or the other 
must be left purely to chance. Group assignment based on 
teacher preferences, children volunteering, or similar 
human actions are not random. To consider them so may be 
seriously misleading (see Hazard 6, p, 19), 

The assumption of random assignment underlies most statis- 
tical tests, A statistically significant _t or F test means 
simply that the observed difference between groups was larger 
than would normally be expected to result from random assign- 
ment. This, in turn, implies that if assignment was random 
the observed difference was probably due to the treatment. 
If assignment was not random, however, a "statistically 
^ significant difference," by itself, is generally meaningless. 

Special projects are most often designed to serve particular 
segments of the population (e,g,, disadvantaged, gifted, bi- 
lingual). Under certain circumstances, children in such 
categories can be selected from a heterogeneous group and 




given a special treatment while the remaining children 
serve as a useful and valid comparison group. Questions 
6 and 7 describe the conditions and procedures for imple 
menting evaluation models of this type. 




Is it possible to match pupils on the basis of pretest 

scores before randomly assigning one member of each pair 

to the treatment group and the other to the comparison group? 

Yes Go to Model 1, page 49 
No Go to Model 2, page 54 



Random assignment usually results in some small differences 
between groups in terms of pretest performance. At least 
some of this difference can be expected to carry over to 
posttest performance. For this reason, it is desirable to 
remove these differences, however small, either by pre- 
assignment matching (see Model 1, page 49) or by statis- 
tical manipulation after the fact using analysis of co- 
variance (see Model 2, page 54). Pre-assignment matching 
is the preferred technique if feasible and has the addi- 
tional advantage of minimizing computational complexity — a 
significant drawback of covariance analysis techniques. 

Matching must be accomplished before pupils are assigned 
to groups. The correct procedure is to identify pairs of 
atud^nts having equal or essentially equal scores on some 
test known to correlate highly with the post-treatment 
measure. One member of eoch paAr is then assigned to either 
the treatment or the comparison group based on the outcome 
of some random event such as the flip of a coin.' The re- 
maining member of each pair is assigned to the other group. 

One of the most common errors in educational evaluation is 
that of matching after assignment. If, for example, there 
are two pre-existing groups, it is common to administer the 
treatment to one of them while selecting pupils with matching 
pretest scores from the other to serve as a comparison gtoup. 
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Although common, this procedure is fundamentally unsound 
and introduces systematic biases into the data. Unless 
matching can be accomplished prior to assignment it should 
not be done at ail. (See Hazard 8^ page 23.) 
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Question 5 

Where a pre-existing comparison group is available » is it 
^ sufficiently similar to the treatment group so that the 

assignment of pupils to groups can be considered *'random 
in effect?" 

* 

Yes Go to Model 2, page 5A 
No Skip to Question 7 

Comment 

As discussed in the Comment accompanying Question 3, 
statistical tests of the difference between the means of 
two groups generally rest on the assumption that group 
membership was determined through random assignment pro- 
cesses. It is possible, of course, for no educationally 
relevant differences to exist between two classrooms of 
third graders in a particular school, or between grade- 
level peers in two schools in a district. Under these cir- 
cumstances, the groups are virtually identical to groups 
which would have resulted from random assignment and their 
composition may be considered random in effect (Lord, 1967, 
p. 38). 

Where pre-existing, intatit groups are used as treatment 
and comparison groups, it is not appropriate to assume that 
they are adequately similar. This possibility must be 
investigated empirically and the onus of proof is on the 
evaluator. Ideally, the process by which students were 
assigned to the two groups should have been effectively 
random. At the very least, the two groups must not be 
significantly different in terms of pretest scores. They 
must also be comparable in terras of socioeconomic status, 
age,^9ex, and racial composition. School size and setting 
(urban - rural) as well as neighborhood should also be 
comparable. Even when these factors are equated, serious 
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biases are ppssible. Such biases arc introduced when tjeache 
or student participation is voluntary or when the choice 
as to which group will be the treatment group and which 
the comparison group is made by principals or teachers. 
This guidebook discourages any use of local comparison 
groups which are clearly dissimilar to the t-reatment group 
(see Hazard 6) T 
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Is assignment to the treatment or comparison group based/ 
on a cutoff value on some pre-treatment measure or com- 
bination of measures? 

Yes Go to Model 3, page 59 
No Proceed to Question 7 



Where the memberships of the treatment and comparison groups 
are neither random nor random in effect, so called "true" 
experimental designs *can no longer be used. Under these 
circumstances "quasi-experimental" evaluation models must 
be employed. 

There are two quasi-experimental evaluation models (the 
Special degression Models) which can provide acceptably 
conclusive evidence regarding treatment impact in situa- 
tfons where the assignment of pupils to treatment and com- 
parison groups is based on need rather than randomization. 
Both of these models, however, require the establishment 
of a cutoff score above which all pupils are assigned to 
one group and below which all pupils are assigned to the 
other. Numerical ratings by teachers, classroom grades, 
and standardized achievement tq^t scores may be used singly 
'or in any desired combination, but there must be a single 
cutoff score. 

Other models exist which do not require assignment to treat- 
ment and control conditions based on a single cutoff score. 
As design requirements of this type are relaxed, however, 
additional assumptions must be made in order to attribute 
the cause of observed between-group differences to treat- 
ment influences, and credibility is thus diminished. These 
models are treated in Question 7. 
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Is 'there a pre-existing comparison group whose performance 
on the pretest measure is superior to the performance of 
the treatment group? 

Yes Go to Model 4, page 1^ 
iMo Go 'to Model 5, page 72 

Quasi-experimental designs all rest on sets of assumptions 
having varying degrees of plausibility. One such assumption 
which is relevant here and appears "safe" is that a group 
which is initially superior to another group in cognitive 
development will continue to grow at a rate equal to or ' 
greater than that of the initially inferior group^ other 
things being equal. If, under these circumstances, the 
initially inferior group outperforms the initially superior 
group after participation in a special instructional treat- 
ment, it is probably safe to conclude that the treatment 
was effective. On^the other hand, if an initially inferior 
group receives the treatment but fails to surpass the com- 
parison group on the posttest (a typical situation) it is 
difficult to draw conclusions with confidence. Under certain 
conditions regression mcJdels not requiring single cutoff 
scores may be applicable (see Model A). Finally, if the 
treatment was administered to the initially superior group 
and its posttest performance remained superior to the com- 
parison group, .it would be difficult to decide whether the 
superior posttest performance resulted from the treatment 
or simply from the inherent superiority of the treatment 
group 

If the only available comparison group scores significantly 
lower on the pretest than the treatment group, the infor- 
mation obtainable from it is usually not worth the time and 
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expense to collect. A norm-referenced evaluation model 
will probably be more useful and will certainly be less 
costly (see Model 5) ♦ 
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IV. EVALUATION MODELS 

This section of the guidebook provides descriptive information 
about five evaluation models suitable for u'se in assessing the cognitive 
benefits resulting from local school projects. They are not necessarily 
the only models suitable for this purpose but they are recommended as the 
most convincing models that can be feasib'ly implemented given the con- 
straints of operating school- systems. 

The five models are: 

1. Posttest Comparison with Matched Groups (p. 49) 

2. Analysis of Covariance (p. 54) 
' 3. Speciai^egression (p. 59) 

4. Generalized Regtession (p. 70) 
5^ Norm-referenced (p. 72) 

Each of these mtfdels Is, descr jsbed in terms of general characteristics, 
strengths and weaknesses, and considerations related to its implementation 
Except where computational procedures are excessively complex and require 
the skills of a sophisticated statistician (the Generalized Regression 
Model), step-by-step procedures are provided for using each of the models. 
References to sources of more detailed information are also included. 

Each of the evaluation models in- this section has specific analysis 
requirements. However, several preliminary steps are useful with any 
evaluation model. These preliminary steps are dis<:ussed in Chapter VI. 
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Model 1 

Posttest Comparison with Matched Groups 

Summary ■ ■ 

Gener a l Characteristi cs. This model requires that children be paired 
in terms of pretest measures and that one member -of each pair be randomly 
assigned to the treatment group and the other to the comparison group. 

Strengths. The matched groups evaluatiqn model provides what is 
theoretically the most accurate estimate of how the treatment group would 
have done had they not received the special instructional treatment. This 
high degree of accuracy is due to the fact that the comparison group is 
constructe^d so as to be virtually identical to the treatment group at pre- 
test time. Thus, if the experiences of the two groups are the same between 
pre- and posttest with the single exception of exposure to the treatment, 
the compaiMson groups should achieve posttest scores which are essentially* 
the same as those which \:ould have been achieved by the treatment group had 
its members not received the treatment. / 

Weaknesses . The manner of assigning pupils to treatment and control 
groups employed in- this model may produce a greater awareness of group mem- 
bership than other, less CDt*vsiv'C assignment procedures. Children in the 
comparison group may realize that their group is not inherently different 
from the treatment group, yet, for some reason, the other group of children 
is receiving special attention. This increased awareness of group member- 
ship may magnify such spurious influences as 'the Hawthorne effect in the 
treatment: group or the John Henry effect in the comparison "group (see 
Hazard 10). 

Implementation Considerations' . This ^evalual ion piodel allows a wide 
choice of test instruments and testing tirfies. However, if, as would be 
recommended,' a norm-referenced comj^rison is ,ilso employed, Liie choice of 
tests and testing times becomes more restricted (see tlie Norm-referenced 
Model, page 72). * ^ 
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According to this model, children in the treatment and comparison 
groups are matched on the basis of their pretest scores and possibly other 
educationally relevant variables as well. At posttest time, it is impor- 
tant to use an instrument that measures the same skills the pretest. If 
children in a math project are matched on the basis of pretest reading 
test scores or IQ and then are given an arithmetic posttest, the increase 
in precision which can be achieved through matching may be sub t.int ially 
reduced. This is because the precision gained is proportional to the 
correlation between pre- and posttest scores. 

In order to implement the model, it must be possible to (a) assemble 
a group of children large enough in number to form both a treatment and a 
comparison group, (b) pretest the entire group, (c) pair children on the 
basis of their pretest scores, and (d) randomly assign one member of each 
pair to the treatment group and the other to the comparison group. If 
eligibility for participation in the treatment group is based on some 
special educational need, this procedure is clearly not feasible and one 
of the other evaluation models should be implemented. 

I mp 1 eme n t a t ion Procedures . 

Step One: Identify a group of potential participants large enough 
in number to form both a treatment and a comparison group. 

Step Two: Administer the pretest to the entire group with an 
instrument known to correlate highly with the measure selected 
for use as the posttest. 

Step Three: Score the pretest. Using their raw scores, identify 
pairs of children with identical or nearly identical test scores. 
Note: Unless pairings are based on^ if^entical scores, there is a 
possibility that the mean pretest scores of the treatment and 
comparison groups may differ by anioiints large enough to influence 
the evaluation outcome. If such diiterences are found, covariance 
analysis should be used to adjust for them (see Model 2). 

Step Four: Once ^.inildren are paired, randomly assign one member 
ot each pair to the treatment group and the other to the compurison 

so 



group. Randi>mizat ion may be done by flipping a coin, using a 
table of random numbers, or any otlier procedure based on chance 
rather than ch'>ice. 

Step Five: Once the groupi^ are formed, it is important to monitor 
their experiences over the treatment period. The experienc^es of 
the two groups should be identical with the single exception that 
one group gets the treatment and tho other does not. Where this 
is not the case, the differences between groups in posttest per- 
formance may not be the result of the treatment, but rather a result 
of uncontrolled attitudinal or experiential factors (see Hazard 10). 

Step Six: Administer the posttest. If at all possible, the two 
groups should be tested at the same time. Large differences in 
testing times allow potentially relevant experiences to occur for 
one group and not the other. Even small differences such as the 
time of day, the weather, the emotional climate and otlfer difficult- 
to-assess influences may alter test performances. 

Step Seven: Score the posttest. Raw scores should be converted 
to their standard or scale score equivalents before any computations 
are undertaken. If a test scoring service is used, it should be 
made clear that each raw score should be converted to its standard 
or scale score equivalent. 

Step right: Compute the following summary statistics by obtaining 
the indicated formulas from any elementary statistics book. 

(a) llio mtMn and standard deviation of posttest scores for the 
treatment group. 

(h) The mean and standard deviation of posttest scores for the 

compar i son group. 
<c) Ihe rorr>"^l at ion hetween groups K<i^ed on tUe original pairing 

•>! ' fii J dren . 

' ^ • nrnK'-r .:\ piir is lost, i.e., n^^ posttest score 

j ' K O'^' i.f'^ r ri'iTihrr nu^t be excluded from all of these 
V ■ r " tcr firthtr analysis considerations. 



Step Nine: Compare the mean post test scores of the treatment and 
comparison groups. If the treatment group score is greater than 
the comparison group score the project may have been effective. 
The statistical significance of the difference should he checked 
using the following formula: 



Y - Y 
t c 



s^+s^^-2r ss 
I c tc t c 



where Y = posttest mean standard score of the treatment 

^ ' group 

Y = posttesL mean standard score of the comparison 

^ group 

s = standard deviation of the treatment group 
posttest scores 

s = standard deviation of the comparison group 
c 

posttest scores 

r = correlation between posttest scores of the two 
tc 

groups 



N = number of pairs of children 
Degrees of freedom " N-1 



The one-tailed probability of the computed t can be found in the 
tables provided in most standard statistical texts. If it is less 
than or equal to ,05 (p s .05), the ?5peciai project may be said 
to have produced at ist icall v significant achievement gains. 
There is no general! accepted criterion for deciding whether 
the si7e of the gain is large enou^'Ji to be considered educationally 
significant. U^iere standardized test^ are us^^d, t standard 
deviation of the national norrn ^»roup ('>) vxi'\idi"-- a us^^ful reference. 
As a rule of tiiunih, tl)e viMioi - suvv',*" ^ oiM-third of a standard 
d**viati(^n abf^ve f>xp» tati'm r^*^^ r.n postt^^st sr-ores as a reasonable 



cutoff value. In other words, if 



the gain may be considered educationally significant. 



\ 

V 



o 

ERIC 



53 



Model 2 
Analysis of Covariance 



Summary 

General Characteristics . This model is appropriate to use where 
individual pupils are randomly assigned to treatment and comparison groups 
or where pre-existing, intact groups which are sufficiently similar to 
be considered random samples from a single population are assigned to treat- 
ment and comparison conditions. Analysis of covariance provides an appro- 
priate statistical adjustment to compensate for pretest score differences 
between groups if these differences were due to such chance factors as 
random sampling fluctuations. If pretest differences are real, i.e., the 
treatment and comparison groups cannot be regarded as random samples from 
a single population, covariance analysis systematically underadjusts for 
the initial differences between groups. The underad justment spuriously 
reduces the probability that the initially inferior group will be found 
superior on posttest performance. Conversely, it spuriously increases 
the probability that the initially superior group will be found superior 
on posttest performance. 

Strengths . When an unmatched control group is used, analysis of 
covariance provides the best method of adjusting observed posttest scores 
for random pretest differences. Comparing posttest means that have been 
adjusted is always more precise than comparing unadjusted posttest scores. 

Weaknesses . This model assumes that treatment and comparison student 
are random samples from a single population so that any difference in pre^ 
test performance is due only to sampling error and random error of measure- 
ment. It will not provide an appropriate adjustiment for pretest score 
differences which reflect non-random differences between groups (see 
Hazard 6, page 19). Where analysis of covariance is employed with data 
from pre-existing, intact groups, there is always some danger ir presumin g 
that the groups are random samples from a single population. 

Implementation Considerations . This evaluation model allows a wide 
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choice in the test instruments to be used and in the time of testing. If, 
as would be recommended, a norm-referenced comparison is also made, the 
choices become more restricted (see the Norm-referenced Model, page 72). 

The control group must be very similar to the treatment group. 
True random selection is strongly advised. If the groups were not selected 
randomly, strong evidence is needed to demonstrate that the selection was 
"random in effect" (see Chapter III, Question 5). 

This model involves extensive computations and, unless they can be 
done at little cost or effort on a computer, a decision should be made as 
to whether the analysis is justified. The degree of precision gained by 
employing analysis" of covariance depends in part on the correlation between 
the pretest and the posttest. If the correlation is relatively low, the 
adjusted values would not differ Very much from the unadjusted values; if 
high, then the posttest means would be adjusted by a correspondingly high 
proportion of the original pretest difference. Pre- and posttest measures, 
consequently, should be selected to maximize the correlati6n between them. 
Multiple covariat^^s may be used to achieve this objective. 

Implementation Procedures . 

Step One: Form the treatment and comparison groups. Assignment 
to groups should be based on a random procedure such as drawing 
well-shuffled name? from a hat. In some cases, intact classrooms 
may^^present a reasonable approximation to randomly selected 
groups. Groups differing systematically on ethnicity, SES, sex 
or other obvious variables are never satisfactory. Similarly, 
a non-volunteer group can never serve as a comparison group for 
vplunteers. 

Step Two: Administer and score the pretest. Testing conditions 
must be exactly alike for the treatment and comparison groups. 
Testing both gi^pups together may be a good idea unless one group 
of students is put at a relative disadvantage, e.g.,. by being 
tested /in unfamiliar surroundings. 

Step Three: At che end of the project, administer and score the 
posttest. Once again, testing conditions for the two groups should 




be exactly alike. Raw scores should be converted to their standard 
or scale score equivalents before any compulations are undertaken. 
If a test scoring service is used, it should be made clear that 
each raw score should be converted to its standard or scale score 
equivalent . 

Step Four: If there is no difference between the groups on the 
pretest, analysis of covariance is not needed. In this case, a 
simple ^ test for independent groups is appropriate for testing the 
posttest difference: 



"N +N - 2 
t c 



Y - Y 
t c 




N s ^ + N s 



t t 



c c 



N + N - 2 
t c 



/N + N 
J c 

N N 
t c 



where 



mean standard score of the treatment group 
on the posttest 

mean standard score of the comparison group 
on the posttest 

standard deviation of the treatment group 
posttest scores 

standard deviation of the comparison group 
posttest scores 

number of treatment group pupils 

number of comparison group pupils * 



Degrees of freedom 



(N + N - 2) 
t c 



The one-tailed probability of the computed t can be found in th^ 
tables provided in most standard statistical texts. If it is less 
than or equal to .05 (p s .05), the project may be said to have 
produced statistically signifiiiui a^^^i it v^er.ent '^ains. 

Step five: Ass'imink: t]w ijroups' d i ! 1 * <^ in mean pretest scores, 
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an analysis of covarianco is roccMnnonded . McNemar (1969, Ch. 18) 
provides a readable explanation of the model. A more complete 
development is available in vJiner (1971, Ch. 10). Uecause of the 
amount of computation involved, the use of a computer is highly 
desirable. Appropriate proi^rams can be provided by most computer 
centt-rs. Wliere the amount of data iji small' or computer .facilities 
are unavailable, the cal^'ulations can be done by hand. Instructions 
for carrying out the analysis of covariance and a set of worksheets 
for simplifying the computational work, is included in Appendix B. 
Tiiese worksheets are referenced directly to the numerical example 
in Winer (1971) and preserve his notation, but- are revised for the 
case of two groups (treatment plus comparison). Since the textbook 
examples are for three groups, they are not directly applicable to 
the typical project evaluation. 

Before undertaking a hand-calculated analysis of covariance it is 

advisable to do a quick check to see whether the effort is justified. 

Analysis of covariance is essentially the same as the above t test, 

but with the posttest difference (\\ - Y ) adjusted to take into 

t c 

, account differences between the groups at pretes.t time (X - X ). 

t c 

If the correlation between pretest and posttest is 1.0 the entire 
difference (X^ - X^ ) is added to the posttest score of the group 
which was lower at pretest time. This is the maximum possible 
<idjustment. Since,^ in practice, the correlation will be less than 
one, the adjustment will be somewhat smaller. To check whether 
the adjustment is likely to affect conclusions: (a) test the un- 
adjusted posttest difference (Y^ - Y^) using a _t test, and (b) test 
the posttest difference with the maximum adjustment using a _t 
test. If both t^ tests are significant, then analysis of covariance 
will also be significant and need not be computed. If both are 
non-significant, analysis of covariance will also be non-significant 
and need not be computed. It is only necessary to carry out the 
analysis of covariance if one t test is significant* and the other 
is not. The two t tests are as follows: 
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(a) adjustment ; 

Use the above formula for _t exactly as written. The no- 
adjustment numerator is 



V - Y 
t c 



(b) Miximum adjustment ; 

Use the above formula for t but change the numerator to 



where X = mean standard score of the treatment 
group on the pretest 

X = mean standard score of the comparison 
group on the pretest 



Step Six: Instructions for determining the level of statistical 
significance for analysis of covariance are included in Appendix 
However, there is no generally accepted criterion for deciding 
whether the size of the gain is large enough to be considered 
educationally significant. l^Hiere standardized tests are used, the 
standard deviation of the national norm group (a) provides a useful 
reference. As a rule of thumb, the authors suggest one-third of a 
standard deviation above expectation based on adjusted posttest 
scores as a reasonable cutoff value. In other words, if 

Y - Y a/3 
t c ' 

the gain may be considered educationally significant. 
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Model_3 
Special Regression Models 

Summary 

General Chara cteristics. Two special regression models are con- 
. sidered here, the Regression Projection Model (Tallmadge 6 Horst, 1974) 
and the Regression-discontinuity Model (Campbell & Stanley, 1963). In 
both models, the selection of treatment participants is determined on the 
basis of performance on the pretest. All pupils/ in a group are pretested 
and those who score above or below a part icular^ score are assigned to the 
treatment group while the remaining pupils serve as a comparison group. 

Strengths. Both models make use of an identifiable and definable 
comparison group. This group offers a sounder basis for establishing no- 
treatment posttest expectations than national norms since the comparability 
of the experiences of the two groups over the pre-to-posttest interval can 
be empirically verified. The use of a sharp cutoff score in these models 
simplifies the interpretation of significant results as compared to re- 
gression models which do not require this type of assignment to groups. 

Weakne sses . The Regression Projection Model tests the difference 
between the observed and expected posttest means of the treatment group 
where the "expectation" is derived from the comparison group regression 
line. The validity of conclusions based on this model rests on the assump- 
tion that the combined-group regression line would be linear over its entire 
range under no-treatment conditions, an assumption whirh is not always 
just if led . 

The Regression-discontinuity :iodel tests the difference between thv 
intercepts of the treatment and comparison gr(^ups* regression line^ viih 
the line representing the pretest cutoff scori*. In its simplest form this 
model involves the same assiimption oi linear regression as does the IW- 
gression Projection Model, hut bv tising iii^her-ordt^*' regress Ion e^|uat ion^ 
(curved regression lines) the problem can he eliminated (Swien, 1971). A 
remaining v;e:ikness is that where tre<itment impart is invers^Ov proporticnial 
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to pretest scores (i.e., the lowebt scoring vstad^.nts make the bigg^'^t gains), 
there may be no difference in regression line intercepUs even where the mean 
gain of the t reatment group is highly significant. 

I 

UPPL^^^S'A^ "^i^iiin^JrP^^ rigiirt > on the tol lowing page il- \ 

liistrates both the Regression Projection and t !)e Kegre.ss ion-d i ont inn i t v 
Models. In this idealized conception, the solid^line portion ot tlie elipse 
to ttie right of the cutoff score represent ^ the actual distribution (scatter 
plot) of the pre- and posttest score-s ot the comparison groun. It is used 
to estimate wh<it the score distribution for the treatment group v;ould have 
been if there had been no special treatment. Ihis no-t reatmi'nt expectation 
is illustrated by the broken-line portion of the elipse to the left of the 
cutoff score. Hie actual distribution ot the treatment group's score/^ is 
illustrated by the solid-line portion of the elipse to the left of the 
cutoff score. Tliis distribution is displaced upward above the no-treatment 
expected scores indicating that tlie treatment did have tlie effect of raising 
posttest scores. 

Regression lines are drawn diagonally tlirougli the distributions 
shown in Figure 2, As mentioned above, the Regression Projection Model 
involves testing the difference between the observed and the expected mean 
posttest scores while the Regression-discontinuity Model involves testing 
the difference between the intercepts ot the regression lines with the cutoff 
score. In the situation shown in Figure 2, regression is lint-'^ir and the 
amount of treatment impact was independent of the pupi 1 s*' pretest scores. 
Under tliese cond it ionsij the difference between means is identical to tlie 
difference between intercepts and tlie two evaluation models should yield 
identical results. 

Figure 3 depicts a situation in which tl»*^ 'l reatment h<id its gre<itest 
impact on pupils farthest below the cutoff s. ue and <i negligible effert on 
pupils right a' the pretest cutoff. Under these c t rcumst anci's , tUv ^lopc 
of tlie treatment group regression line is flatter than that ot iln- tompari^UHi 
group. There is no difference between the intercept.^ ol tlie regression lines 
with the cutoff score, but ther'e is a different** between the expected and 
observed me<in posttest scores of the treatment i'.roup. While (hi^^ (litftren(e 
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v^ould liAVv Hc't^n detected bv tiiL* KegressicMi I'ro jeot ion Model and not by the 
Regression-discontinuity Model, otiier kinds vM' treatment impact variatioiK-^ 
uould be more readily detected bv tiie Re^ress lon-d iscon t inu ^ v Model. 
Because its is sumptions are less subject to question, the Ivitter model may 
also be considered more conclusive, especially in its general form (Sween, 
1971). 

1 1 is not possible to provide general decision criteria is to which 
of the Iw models is the more appropriate for a particular situation, 
i'lotting the scatter diagram, hov;ever, should provide some insigii'^ is to 
what kinds of intluenLCS are operating <uid, consequently, whicli of rlie two 
TTodels should be used. Knowledge about the treatment may also hoip. ll, 
tor example, a parlieular project provides remedial instruction in p»\ p^r^ 
tion to individu.il scud^'!lts' needs, it would be more appropriate to expect 
*^he kind of impact illustrated in Figure 3 than that shown in Figure 2. 
In I his in^tan^e, the :<egre-si*^n i'rojeLtion Model woult^ l>e the proper choice. 

Ihe utility of both special regressiiMi models is proportional to 
t hi si7^' of the correlation between pre- and post test scores. The rela- 
tionship between the si^e of this correlation and the validity of inferences 
wnicli can be drawn from implementation of the models is precisely analogous 
to the s ime relationship in Models 1 and 2. 

rsiTi),^ test -rores .j^ the ^ole do t e rm in<in t- o f pupils' needs for special 
i Mirt i^'H U tr<atnf"0 . t privti^e o^io rdiuators consider unacceptable. 
]:ns o!)j»<iion ^ ta l^e n so I vt li b u-i'i' i t opipt-)-> i te measure made up, for 
example, ot i pretost -,ore .iiui an i n<b^pendeat 1 v-made , numerical, teacher 
rat mj; o f net d . 

DTie ulditionil poin^ is n-U'Wua. Ihc entire discussion of the Re- 
^re^siv>r. »ro|tLtio[i Mod*. I !i is js^nnied t ii<!t t h»' t(Mnpari.son group degression 
ef^uat icM^ would be u L-d to t ^tiriite iiov th»' treatment s^^roup would have per- 
r.>imed fiad f'le. lot re.eived the treatment. ^^^^^^ equally 

possible to ust tf" f r» ament --roup r« gr< ssion line to est iiii<Ue hov; 

the comparison .-roup would have prrf(»mo(! if they had received the 

tr<'.itmtnt. When ih, tre.Hment .}ff<>^.ts ih^' <^lope of the re^., oSion 

line in t mjnn»r shown in Fn'ur< Ijov/ever, this practice xcould 

h'.Kl to the < rroneous (onclusion thai t!u tre.ament had a negative impact. 
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hn p 1 em en t at i o n_ jro£edures . 

Step (^ne: To implement either special regression model, adminis- 
ter and score the pretest. The test should be given to all 
members of a group from which the^ t reatment pupils are to be 
drawn because of their special needs. The pretest should cor- 
relate substantially with the posttest measure. 

Step Two: If desired, generate a composite score which incorpor- 
atei> the pretest measure and any other measure such as independ- 
ently-made teacher ratings. 

In generating a composite score, each score element should be 
weighted in proportion to the standard deviation of scores on that 
element. For example, a set of pretest scorei? has a mean of 20 
and a standard deviation of 3. These are to be combined with teacher 
ratings having a mean of 7 and a standard deviation of 5. The pre- 
test scores will thus account for 3/8 of the composite score varia- 
bility while teacher ratings will account tor 5/8 of it. In order 
to give the two measures equal importance in the composite score, 
each test score would have to be multiplied by 5/3 to equalize 
the standard deviations of the two measures. Then the weighted 
values would simply be summed. Composites involving more than 
two measures can be constructed in a similar manner although it 
seems unlikely that a composite involving more than two elements 
wo!ild often be required. 

If composite scores are used, it must be remembered that theV 
then become the pretest measure. All future calculations invol- 
ving "the pretest measure" must use th<.^ composite measure-^-not 
one of its elements. 

Step Three: Establish a single cutoff score. For a remedial 
project, assign all pupils scoring below tiiis value to tiie treat- 
ment group. Alternately, students scoring above a cutoff score 
might be assigned to a special project for the gifted. One con- 
venient way to establish a cutoff score is to determine how many 
pupils can be ser/ed by the special project and then count up or 
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down from the lowest or the highest score until the quota is filled. 
Once tho cutoff is established, it mu^t be adhered co strictly. 
There can be no exceptions made in the assignment of each grade 
level if more than one grade level will be involved in the 
special project. 

Step Four: Administer and score the posttest. All available 
pupils in the original group must be posttested even though only 
a relatively small proportion of them may have participated in 
the treatment. The subsequent analyses can be performed using 
raw scores although it would be preferable to convert both pre- 
and posttest scores to their standard or scale score equivalents 
if standardized tests are used. 

Step Five: To carry out the computations for either the Regression 
Projection Model or the simplest version of the Regression-dis- 
continuity Model, calculate the following values: 



\^imher of pupils 



Treatment Comparison 
Group Group 



N 



N 



Mean of (composite) pretest scores 



X 



Standard deviation of (composite) 
pretest scores 



Mean of posttest scores 



Standard deviation of posttest 
scores 



(-orrelation of posttest with 
(composite), pretest 



Slope of the regression line for 
predicting Y from X 
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Prograrris arc^ readily availablt^ for all computt-rs and pro>:^ramna 1 e 
calculators to a.sslst in these calculations. Ihe nap<x>s or 
descriptions of appropriate proj^rair^s usually specify that they 
c^ompute Pearson prod net -moment correlations and, in general, 
ill ol the above values will be printed out automatically. 

if no computational facilities are available, the calculations 
nia\ be done by hand. Computational formulas and instructions 
may be found in any introductory statistics book. It will simplify 
the task to recall that 



Once r^L aDove values have been calculated the remaining compu- 
t it ions are relatively simple. 

step Six A: Regression Projection Model. 

in the Regression Projection Model, the actual mean post test score 
of. the treatment group (Y^ ) is compared with an estimated no- 
treai.aent value (Y^) obtained by projecting the comparison group 
regression line. 

fnis predicted value is calculated by the following formula: 

Y^ - Y + h (X - X ) 

t c c t c 

The amount of the treatment effect is the difference between the 
<i< tual and the estimated mean scores, or: 

^ - ^ 

Th^' statistical significance of this difference may be tested 
tisin.' the following formula (i<illma<lge L Horst , 1974): 
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P ^(Y ~ Y )- (N + N - 3) 
I t t t c 

+ N - 3 - 



t c 



Where : 



- 2b bs.^^ + b Y )^ 

Y c X c X t c t - t 



P = 



N 

t 



t N + N 

t c 



N 

c 



c N + N 

t c 



t c 



t c 



b = P b + P b 
t t c c 

Step Six; Regression-discontinuity Model 

The simplest form of the Regression-discontinuity Model consists of 
fitting straight regression lines independently to the treatment 
and comparison groups, then testing the difference between the two 
lines at the point where they intersect the pretest cutoff score value. 

Let; K = the (composite) pretest cutoff score 

Y^ = the Y value of the treatment group 
regression line for a (composite) 
pretest score of K 
Y^ = the Y value of the comparison group 

regression line for a (composite) pre- 
test score of K 
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Then: 



Y = Y + b(K-'X) 



Y = Y + b(K-X) 
c c c c 



Unlike the Regression Projection Model in which a treatment effect was cal- 
culated, there is no special interpretation of the value Y - Y unless 

t c 

the regression lines have equal slopes. In this case it is a treatment 
effect. However, if this value is significantly greater than zero, it 
is evidence of a real treatment effect. The statistical significance of 
the difference may be tested using the following formula (Sween, 1971): 



N +N -4 
t c 



(Y^ - Y^)^ (N^ + - 4) 



N + N 
t c 

N N 
t c 



+ Z + Z 
t c 



Where: 



c 

N ^ s 
c N ^ s ^ 

C X 

c 

The one-tailed probability of the computed t value can be found in the 
tabh > provided in most standard statistical texts. The subscripts for 
- ^^t^'^c""^' N^+N^~4) are the appropriate numbers to use for the "degrees 
of freedom" column in the table. If P is less than or equal to .05 (p < .05) 
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the sSpecial project can be said to have produced statistically significant 
achievement gains . 

There is no generally accepted criterion for deciding uhether the 
size of the gain is large enough to be considered educationally significant 
Where standardized tests are used, the standard deviation of the national 
norm group (j) provides a useful reference for the Regression Projection 
iModel. As a rule of thumb , the authors suggest one-third of a standard 
deviation as a reasonable cutoff value. In other words, if 

- > a/3 

the gain may be considered educationally signifi'cant. 
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Model 4 



General Regression Model 

Suntmary 

Gen eral Characteristics . This model may be thought of as a more 
generalized form of the analysis of covariance model. Posttest difference 
between any two (or more) groups can be tested, adjusting for the tiffects 
of any number of quantifiable variables such as pretest scores, sex, SES, 
location, etc., and their interactions. The effects of using curved 
regression lines can albO be tested and removed. 

St rengths . The model itself places n,o restrictions on the selec- 
tion of students, their relative pretest performance, or on any facet of 
the experimental design. Where other models implicitly assume that post- 
test results are not related to variables other than the pretest, the 
General Regression Model permits systematic tests of this assumption. 

Weaknesses. All forms of the general regression model, including 
the special case of analysis of covariance, test the hypothesis that post- 
test differences are the effect of random fluctuations. \^ere treatment 
and comparison groups were clearly different to begJLn with, this is not 
a useful hypothesis to test (Lord, 1967, p. 38). Regression models are 
frequently used to statistically "equate" groups which are clearly dif- 
ferent but, based on rather plausible assumptions about the nature of 
the differences, i;egr^ssion models systematically underadjust (Campbell & 
Erlebacher, 1970). It should be noted that this underadj ustment is minor 
where the correlation between the pretest and the posttest is high and 
will provide a conservative estimate of project Impact where the treatment 
group has a lower mean pretest score than the comparison group. 

IPPj.^Fi'g M^ ^Q" ^^^^^'^ LV^"*"^-' While the flexibility of this 

model may permit an adequate evaluation where none of the other models 
is feasible, the complexity of both the multivariate statistical manipiL- 
lations and the experimental design issues create major obstacles to im- 
plementation. Only the most sophisticated specialists in Tvhese. areas 
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should attempt to plan and implement a study of this nature. 

Implementation Procedures . An ad hoc design and detailed procedures 
must be developed for each evaluation by a qualified specialist. A com- 
plete, highly technical, mathematical development of the model is avail- 
able (Horst, 1974). 
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Model 5 
Norm-referenced Model 

Summary 

Genera l Ch arac terist ij:s . Project children are compared to a norm 
group usually comprised of a nationally representative sample of children 
at the same grade level. The no-treatment expectation is that the project 
pupils will maintain, at posttesting, the same achievement status with 
respect to the norm group as they had at pretesting. If their posttest 
status is higher, the assumption is made that the improvement resulted 
from participation in the special project. ^ 

S t reng t hs . Where no comparison group is available, the norm group 
provides a plausible estimate of no-treatment posttest scores. Even where 
a comparison group is available, unless it comes from the same population 
as the treatment gro^up the Norm-referenced .Model offers a more defensible 
estimate of posttest performance at substantially less cost and effort 
than a comparison-group design. 

Weaknesses . The validity of the model rests on the assumption that 
the achievement status of a particular subgroup remains constant relative 
to the norm group over the pre- to posttest interval if no special treat- 
ment is provided. Empirical support for this assumption is minimal. It 
is conceivable that some subgroups would move up and others move down in 
the normal course of events. Wlien the norm group is like the treatment 
group, the plausibility of the underlying assumption is greatly enhanced; 
thus, for example, norms for gifted children would be best for assessing 
a project serving such pupils. 

1 mp 1 emen t a 1 1 o n Cons i d e r a t i o n s . This model is widely applicable as . 
it does not require a comparison group. The model requires the use of 
standardized tests. The same level of the same test should hi* used for 
both pre- and posttesting (see Hazard 11). I^rogram partic ip<ints may not 
be (.hosen on the basis of their pretest scores (see Hazard 7). Both pre- 
and posttest inj^^ must be accomplished on dat(*s c(^rrespond ing to the ones on 
which the test publisher collected normative data (see Hazard 3). 
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Implementation Pjrocedure s . 

Step- One: Select a standardized achievement test which has real 
normative data point.^ at dates which are suitable for pre- and 
posttesting. Information about the normative data points for 
some of the most commonly used instruments is presented in Appendix 
A. Similar information about other tests can be derived from the 
Technical Manuals provided by the test publishers. 

It can be seen from Appendix A that most tests have only a single 
data point either. in fall, winter, or spring. Use of these tests 
requires a 12-month pre- to post test interval. If high student 
turnover is expected, it might be better to choose a test for 
which normative data have been collected in both fall and spring 
even though the choice of tests is then quite limited. 

Step Two: Administer and score the test in exact compliance 
with the procedures specified by the test publisher. Each test 
score should be converted to standard or scale scores. If the 
tests are scored by a scoring service, be sure to specify that 
each raw score should be converted to its standard or scale score 
equivalent . 

Step Three: Compute the means and standard deviations of the pre- 
and posttest distributions of the standard or scale scores if 
thej^e are not provided by a test scoring service. Also compute 
the correlation between pre- and posttest scores. Computational 
formulas for these '*surranary statistics" can be found in any ele- 
mentary statistics hook. It is necessary, of course, to do sep- 
arate computations for each grade level participating in the pro- 
ject . 

Stt'P Tour: Look up the percentile equivalents of the mean pre- 
test and posttest standard or scale scores in the norm tables 
i orre-^ponding to the pre- and po.^ttest administration times. The 
pretest percentile score is used to derive the no-t-reatment post- 
tost expectation. In the absence of a special treatment, it would 



73 

7S 



be expected that a group of pupils would maintain its standing 
relative to the norm group. Thus, the expected posttest score 
can be found by looking up the standard or scale score equivalent 
of the pret e s^t percent ile in the posttes t norms t able . This 
score constitutes the no-treatment posttest expectation. 

Step Five: Examine the obtained posttest score in relation to 
the expected score. If the obtained or observed score is larger 
than the expected score, there may be some reason to believe that 
the project was effective. The statistical signficance of the ' 
difference should be checked using the foJLlowing formula: 

obs exp 

s^ -f Sy - 2r^yS^Sy 
N-1 



^N-1 = 



observed mean posttest score 
expected mean posttest score 
pre|:est standard deviation 
posttest standard deviation 
correlation between pre- and posttest scores 
number of c^hildren 
degrees of freedom 

The one-tailed probability of the computed _t can be found in the 
tables provided in most standard statistical texts. If p is less 
than or equal to ,05 (p < .05), the special project may be said 
to have produced stat isticallv" si^n: icanL achievement gains. 

'^here is no generally accepted criterion for deciding whether 

the size of the gain is largt^ enough to be considered educationally 

sigir4f icant . .Since standardized tests ar(^ used, the standard 
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l^ere Y , 

obs 

Y 

exp 



N 

N-1 



deviation of the national norm group (a) provides a useful refer- 
ence. As a rule of thumb, the authors suggest one-third of a 
standard deviation above expectation based on posttest scores as 
a reasonable cutoff value. In other words, if 



Y ^ - Y > a/3 
obs exp - 



the gain may be considered educationally significant. 
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V. GETTING THE DATA (TESTING AND RECORDING) 



Once an evaluation design and an appropriate achievement test are 
chosen, the most crucial step in the evaluation process is th*^ collection 
of accurate, complete data. Analysis of the data may be a more technically 
complex step, but when analysis errors are discovered they can usually be 
corrected. On the other hand, if data are distorted or missing, no amount 
of analysis can adecjuately correct tlie problem. If there are too many 
flaws in the raw data, the entire evaluation becomes meaningless. 

There are four steps in obtaining test data, each requiring planning 
and decisions: (a) assembling the students, (b) administering the tests, 
(c) scoring the tests, and (d) recording the scores. 

Step 1: Assembling Students for Testing 

This step, often passed over lightly, is an important consideration 
for two reasons. First, of course, the time of day and the place where 
students are assembled may affect test scores. The date of testing may 
'also be important (see the Norm-referenced Model, page 72). Second, 
unless the problems are carefully thought out ahead of time, procedures 
used for pretesting students may prove so cumbersome that changes are 
made for the posttest. Changes such as testing students in their class- 
rooms rather than in a large assembly hall may or may not make a big 
difference in scores, but it is certainly not safe to assume that there 
is no difference. Having to abandon half of a carefully selected control 
group because posttesting is too expensive i<^ rjearly undesirable. Care- 
ful planning could avoid all such problems. 

It is difficult to generalize about rules for assembling students 
because of the wide differences among schools. Most important is to mini- 
mizQ the disruption to the students while insuring that ajl tre.'itmenl 
and comparison students can take both pre- and posttest s under similar 
testing conditions. The major problems in achievirij^; tins goal are hiyji 
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absentee rates and distribution of students acrosb a large number of 
schools. \>Jh'ere the evaluation simply involves testing project, students 
in ttieir regular project setting, few problems should be encountered. If, 
on the otlier hand, control students are involved, or if students are to 
be tested before the project begins or after it ends, then it is well 
worth the effort to lay out in detail the number of different tests or 
test levels to be used, the number of test locations, the time for each 
*:est, the number of make-up sessions, the number of special test admini- 
strators or supervisors, and so on. Testing often turns out to be a big- 
ger project than anticipated, and, if reduction of effort is necessary, 
/ it is better to simplify both the pretest and posttest proportionally 

rather than expending too much effort on the pretest, and then being unable 
to complete the posttest. 

Step 2 : A d ministe ring the Tests 

It goes without saying that test administration should be orderly, 
and that cheating and other irregularities are not permissible. But orderli- 
ness is not enough. For the purposes of evaluation it is necessary to have 
consistency . There are two kinds of consistency to worry about, depending 
on whether a norm-ref^&Venced or comparison-group evaluation design is used. 
If a norm-referenced design is used, the critical thing is to be sure 
that the te^t publisher's procedures are followed exactly. This specifi- 
cally includes reading instructions, answering questions, doing practice 
problems, and timing each section. 

V/lien a comparison group is used, it is still advisable to follow the 
publislier's instructions to the letter so as to make norm-referenced compari- 
son possible, but the most critical thing becomes the similarity between 
treatment and comparison group testing situations. The most straightforward 
way of insu^^xx^g comparable situations is to test both treatment and compari- 
son students as a single groupl l)ut, usually, in either norm-referenced 
or comparison-group designs it will be necessary to t<*st several groups, 
and special steps must be taken to make sure that they are tested under 
s imi lar cond it ions so tha l the ir scort-s can be compared . 

1. However, bringing comparison group pupils into an unfamiliar project lab 
for testing m.iv put them at a d i "^advanl age . 
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i'lK c .lie basically two wa>s of making test situations comparable. 
I. (..^ tr.t A tew carefully trained administrators to test all the i^roups. 
' « . I I ' . t<.' carefully train the regular teachers to give the tests 
to tile I i students. At best, the latter alternative is much less de- 

-.irablc i i « '"a i rc^eaicli viewj.)oint, and -^oiu*. muniluring ot t\w tc-u ink: pro- 
cedure i , ii^'i^.ible. If teachers must be used, it ma% be -idvi ^ai'U' 
?i<^.' T it ' ti ( LUiii i>thers* cla^^-^e^ t<^ rnnlmL^e possible biases. 

•^1 Jelling teachers to look over thc^ le^t manu.il i.^ never ad- 

ecjuit^ if -nc i.. serious about the evaluation. Each test administrator 
should m. iripres.^ed by the importance of following procedures e xac 1 1 y , 
md '1 uic ->riinijd have at least "walked through'' the entire process, 
t i oi'i .lan uiu; '>iit pencils to collecting the tests, before ever administer- 
ing, thi* le^-t iQ an evaluation. Where teacher judgments are involved in 
N. .M in^ >Liident reponses (as in oral reading tests), substantially more 
trairiin.; is re(|uired. 

>LlP i. .^trni^ the' Te sts 

.^K'^Lin^ oi standardized tests is usually separate from tei-^t admini- 
stration, ->o it. ht'oomes the third step in the data gathering process, 
ohviouslv, ihv most important requirement in scoring is accuracy, but 
there' ,11' c ti<jdt— offs of time and money to consider. The major variables 
ire 'aIk) (t the scoring and what ^ ype of answer form to use. Most of 
tiio oki ]Oi tct ^ i AW be purchased with machine-scorable booklets or sep- 
ir.H«* an . or .,hooi-.. Some non-stand<ird ized tests may be available only 
I -1 'i iiid ' . I > 1 1 <J vers ions . 

:.\iiu t I' tor in choosi* among answer forms is the age of the 
■^t 1 i< ii - . ^L^Kirttc answer siicets arc usually mu('>^ easier to proces-^, 
t)!'t VM.in^^ (Mldreu tend to score lower on tiiese forms, nresum^ibly because 
t )»* torni '!< cnntiising to tiiem. In general, separate answer shec.s are 
>au /'ii' ' i\>i>.'r average fourth graders and all older students. Younger 
>h}ldr<n .'..). j 1 use maciilne-scorable or hand-scored booklets. (H«ir(.ourt 

'^i a. ' Mi. . , ly; j. ; 

■ ri (..'er type of for") is used, there are three f)asic w<iys of having 
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ilie test scored. Scoring can be done by; (a) local schoo] personnel, 
(b) the publisher of tlie tcst» or (c) an independent test scoring conj\'^ny. 
A choice between the te^t publisher or an independent company will depend 
on a variety of variables specific to the local situation and the test that 
is chosen. Cost, turnaround time, and quality of service may vary, as 
well as the services offered, and some shopping is in order. The major 
decision, however, is whether to have the scoring done by either type of 
service or to simply have the scoring done by available school personnel, 
obviously there is no general answer that will apply to all situations. 
The Ilia j or advantages of a good scoring service are the accuracy and the 
variety of analyses provided by computer processing. The major disadvan- 
tages are the cost, the care necessary in preparing the ansv;cr furms, and 
the turnaround time. There are also Che possibilities thit forms will be 
lost in shipping, or that mishandling or faulty equipment will result in 
scoring errors. There is little recourse when forms are lost, but spot 
checks on scoring accuracy should he made after answer forms hive In'on 
returned . 

"Ballpark'* figures for machine-scored forms (taken f roiri on*.' widely- 
used publisher's service) range from $.30 to $.70 per pupil <{ppendin^' 
on the type of form and length of the test battery. Hand-scor^'d b^^oklets 
cost three or four times a^ much to score although a lower original pur- 
chase price will offset this difference sliglitly. Clearly, local pcrsem- 
nel can do the basic scoring at lower cost, but included in tliis publisher's 
price are a number of fe<itures and services that are cosily and time i^on- 
suming when scoring is done by hand. Ihe*^e include: (<i) (onvoniently for^ 
m.ited reports in triplicate for ea<. h group (e.g., class), ^.^mpletcly 
identified as to test » date, group etc.; (h) raw scores, por(et)tile scores 
(local or national distributions), and standard stores for ea^ h student on 
each subtest; and (c) mean raw scores for v.ich group. Several ot hi^r aaal- 
y^.es <ire available for prices ranging from an <uidititnal ^'.O'y t > ' . 1 7 per 
student for e,Kh analysis. lliese include s'ore d ist r i but i oi^ ^ fu r.]c\i class, 
item analyses, and individual student prol i les. Addit icmal a it t U al 
analyses ar.^ readily available, or, for ^sthools with a^-'.^". . ?o th**ir own 
computer facilities, the s( oies are available from t lie pe:)]is!i<} or^ > ^^tni- 
puter cards or tap**. 
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In short, for very small tryouts with simple analyses it may be 
desirable to do the entire job locally. Unless local computer facilities 
are available, however, more extensive evaluations may well be completed 
more accurately, thoroughly, and economically with the help of a scoring 
service. All the major services have literature and consultants to pro- 
vide details and to assist in planning the scoring and analysis. 

Step 4: Recording the Scores 

The final stop in data collection is the recording of scores in a 
usable format. In practi of course, this may not be a separate step, 
and certainly planning to record the scores cannot be put off until 

the other three step.s >t completed. For example, if a commercial scoring 
service is used, scores may be returned in the standard format used by 
the company. This is particularly true when computerized scoring and 
analyses are done. However, schools that do their own recording or wish 
to transfer scores from computer printouts to a more convenient form need 
to think carefully about the way they wish to record their scores. The 
exact format may seem like a small matter, but in many school districts 
data from past evaluations are so badly arranged that any analysis (espec- 
ially where data are filed away for a year or two) is almost impossible. 
Getting scores copied correctly onto record sheets is not a complicated 
problem for small-scale local studies, but it must not be overlooked. 
Even the most conrc lent ious recorders make errors, and all record sheets 
should be carefully proofread, preferably with one person reading aloud 
while a second checks the scores. 

It is not possible to prescribe a standard format and recording pro- 
cedure becaiL->e school requirements and situations vary so much, but some 
general principles can be established. Basically, scores must be correct, 
C'Tiplete 1 y identified, and arran^^ed in such a way that they can be easily 
analyzed. A dat-i form illustrating these principles is shown in Figure 4. 
Specific characteristics are disfus<^t*d below, starting at the top of the 
form. 
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Ooruvicit:'* tor d ata reco rding forms 

1. Most sets of scores require more than one page. The page number 
identifies each sheet and the "number of pages" helps make sure 
no pages are missing. 

2. Every sheet of paper should have a name and date to indicate 
who filled in the numbers in case any questions arise in the 
tutu re . 

J. Yhv group that is recorded should be clearly identified at the 
loj^ of the page to simplify identifying their data when it 
becomes only one set in a large stack. 

4. It simplifies analysis greatly to have only one test (pre and 
post) recorded on each sheet, provided the rules for listing 
students (see points 5-10 below) are followed. The complete 
name of the pretest and posttest (taken exactly from the test 
booklets and including publ icat ion date) must be listed. This 
point is widely neglected. 

5. Ide^itifying students and ^rganizing their names efficiently is 
the most difficult problem in recording student data. Where 
evaluations are only, for/ one year and are based on fall and 
spring testing, the problems can be solved with a little effort 
and care. But where students must be followed over several 
years, there is no simple solution since students come and go 
from projects, and groups are reorgani|:ed every year. The sim- 
plest rule is to make sure that the posttest scores are all 
t^ntered on the same sheet of paper as the corresponding pretest 
scores. This at least eliminates the problem of the evaluator 
trying to find each student's name on two lists. 

h. A sec^^nd rule for listing student names is to establish a 

standard ordering of the names, and stick to it for the life 
ot the evaluation and for all tests that are used. If a student 
iTioves or fails to take some of the tests, then the appropriate 
tatries should be left blank, but he should not be eliminated 
1 r )m the list. If new students enter the program, their names 
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should he added to the end of the lists for all tests, ev^n 
those for which no data will be entered. In addition 'to the 
obvious reduction in confusion, there are some practical ad- 
vantages to this procedure. For example, a master form can 
be prepared with only the students* names and identification 
numbers filled in, and the forms can simply be duplicated 
when new tests are given. It also makes comparisons or cor- 
relations between any two sets of scores relatively easy be- 
cause any two forms can be laid side by side and the corres- 
ponding names will lineup correctly. If there is a compelling 
reason to change the order of student names in the middle of 
a project, then either all forms* should be changed or a double 
set of forms (old and new order) should be maintained. 
A rule should be established for recording names. "Caldwell, D.E 
should never become "Danny Caldwell" on a second list. The sim- 



plest procedure is to aliowr plenty of space and to spell out 
first names and middle initials (e.g., Caldwell, Daniel K.). 
Each student should have an ID number that completely identifies 
him. The example in Figure 4 uses a one-digit experimental con- , 
dition number, a two-digit group or class identification, a one- 
digit sex code, and a two-digit student number: In some evalua- 
tions, other codes (including letters) can be used, hut qa'ref^ul 
consideration of the sitViation is necessary In order ro permit 



any desired grouping simply by ID number. 

The page should be arranged so that it can he photocopied without 
the students' names. This permits wide use of the data for 



A page should have some reasonable cyjmber^of entries, probably 



are popular, and often the number of entries varit^s fron page to 
page. Unnecessary complications like this help to make the statis 
tician^s life miserable. 

Test dateij are critical, especially in norm-referenced <:»v<!lua- 
tions. If all students listed on a page have their prf»*ests in 



research purposes without compromising student priVcicy. 




For some inexplicable reason, numbers like 27 and 33 
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one day and all are later posttested in a single day, then the 
te<;t date column is not really necessary. However, this is 
usually impossible to predict at the time the form is made up, 
and the columns should be there in order' to permit identification 
of make-up tests and late entries into the program. 
12. Pre- and post test scores should, in general, be in adjacent 
columns, rather than pairing each pretest raw score with its 
standard score, percentile score, etc., followed by each post- 
test score and its transformations. This greatly simplifies 
the mechanics of analysis; comparisons are nearly always made 

between pre- and posttest scores of the same type, 
o 



SO 
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VI. ANALYZING THE DATA AND REPORTING THE RESULTS 

B<isic decisions relating to data analysis should be a part of the 
original evaluation planning. The major decision is the selection of a 
suitable evaluation model, treated in Chapters III and IV. A second con- 
sideration wuich should be settled at the same time is the division of 
the students into analysis subgroups. ^Because of the advantages of having 
large numbt-rs of students in an analysis, mere is some temptation to 
analyze all available treatment students as a single group., and, where 
comparison students are used, to combine all of them into a second group. 
This practice is not just if ied when distinct subgroups of students are 
represented. In particular, it is almost never advisable to combine data 
from (a) differeju treatment conditions, (b) diffejrent grade levels, or 
(c) <liffereiit tests. In most education projects it is more meaningful to 
an<ilvze eacli suh^^roup separately, draw separate conclusions for each sub- 
group, and then summarise the results of these individual analyses. Unless 
adequate thought is given to the analysis subgroups in the initial planning 
stages, the subgroups inay be too sm^ll or too heterogeneous to permit any ^ 
convincing conclusions. 

;;hfn t[R' analysis subgroups are determined and the data are in hand, 
tat' analvwis can proceed. Ihe essential steps for implementing each of 
the live evaluation models are treated in Chapter IV, but the following 
preliminarv .tnalvsis and .screening procedures should substantially facili- 
tate int»>rprcKit ion o\ the formal analysis findings. 

lor ^tiTdtitts with both pre- and posttest scores: 
i\) Plot the distribution of the pretest raw scores, and 

compute the mean and standard deviation. 
(2? Pl'^t the distribution of the posttest raw scores, and 

compute tue mean and standard deviation. 
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1^3) Plot the joint pretest-posttest distribution, and compute 
the product-moment correlation. 

For students with pretest scores only: 
(1) Plot the distribution of the pretest raw scores, and 
compute the mean and standard deviation. 

For students with posttesc scores only: 
(1) These scores are usually not interpretable by themselves, 
but may be saved^for student files or used as baseline 
data for following-year evaluations. 



In general, the size of any achievement gains will be apparent from 
the above analyses. The differences in mean scores which are tested sta- 
tistically in the various models can be inspected graphically by comparing 
the appropriate distributions. However, an equally important use of the 
plotted distributions is to permit inspection of the data for irregularities 
which may influence the interpretation of results. It is not possible to 
list all the kinds of irregularities that might be encountered, but the 
following occur frequently and are important: 

Floor or cei]*infi effects : Preftest and posttest distributions should 
be inspected to see whether they are bunched near the top or the bottom of 
the score range. The top of the score range is simply the highest possible 
raw score. The bottom of the score range may be zero, but for multiple 
choice tests it is usually taken to be the score that would be expected 
if students were simply guessing. For example, in a typical four-choice 
test students could be e>qpected to get about one fourth of the items correct 
by guessing. The impacts of floor and ceiling effects are discussed in 
Hazard 4, page 15. 

Large change s in standard deviations from pretest to posttest : A 
large increase in standard deviation indicates that the project is spreading 
the students out by helping the initially better students relatively more 
than the others. A decrease indicates* that initially low scoring students 
are helped relatively more. Either effect would be an important finding 
and should be described in any evaluation reports on the project. 
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Low correlations betwe e n pre- a nd post te st sco res o r i r r egii l ar 
joint distributions ; These symptoms can be the end result of a variety 
of problems but, typically, they indicate that the tests are not measuring 
the attribute of interest with sufficient reliability. If the skill is 
not measured reliably then, clearly, improvements will not be adequately 
measured, and positive project results may be obscured. With standardized 
tests, correlations of ,80 to .90 are possible. As correlations drop, 
results become correspondingly less precise. 

Di fferenc es betwe en pretested stu dent s who t oo k the pos ttest and 
tho se who didn * t : If students who have only pretest scores appear to be 
much different on the pretest from tho^e wlio took botli pre- and posttests, 
some investigation is required. There are many possible explanations. 
The better students may graduate, or poorer students may drop out, or both. 
Such findings are themselves important, and may also be relevant to the 
interpretation of posttest distributions. If the better students are mis- 
sing from the posttest distribution, the mean score will be depressed. 
If the poorer students are tnissing, the mean score will be spuriously in- 
flated . 

Once the data have been carefully examined, the statistical tests 
of the appropriate model may be applied. In most cases the results will 
have been clear from inspection of tne diJ^tribut ions and test of signifi- 
cance will serve mainly as a concise, easily reported confirmation that 
differences were or were not likely to he due to chance ^'actors. It must 
be remembered that statistical significance depends, in practice, on the 
number of students in the distributions. Lven trivial difference^ in mean 
scores become statistically significant wiien ^uindreds of students are in- 
volved. Conversely, most project effe*. ts that could be represented as 
educationally important will prove to be statistically significant, even 
with as few as 25 or 30 student-*. 

The question of hov hig a gam must he before it is considered edu- 
cationally important is, of ro'irse, a judgmental question r<ither than a 
statVstical one. The evaluator or the i>roject director may well be called 
upon/to offer an opinion on thi-. is .Me^ and while no ->pecifi(^ guidelines 
could cover the variety of settings and situations for educational projects, 
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the above comments suggest three issues that must be c lear! v sv'par.itod in 
drawing conclusions about the educational import anoe oi prujt.cc eittu^Ls. 
One issue is the si^^^of the project effects. A second is iht- cv>->t 
associated with implc?ment ing the project, and ciie third is the conclusive- 
ness ot the evaluation results. 

The importance of a given project effect usu<illv di^pends on the cost 
of the project and the available alternatives. Ihat is, a proieci t liat 
costs very little in money or- effort may be verv worthwhile even it its 
effects are rather small, provided there are no ohviou.s, supeif lor ilterna-- 
tives. Anv large effect it? obviously important in principle, hut in prac- 
tice it may be very costly, «ind che«iper al ternat ivt* . pro jec t s nav liavc 
comparable effects. While neat cost-ef f ect ivenesi^ conclusions u\ sfil] 
beyond the state-of-the-art in educational evaluation, dec. ision.^ should be 
based on the best information that can be provided. 

In addition to the size cf the project et feet , tiie conclusiveness 
of the evaluation should also be discussed. The total evaluation should 
be weighed in terms of all of the issues discu.ssed in tiiis guidchoolv, and 
factors that appear to affect tne results siiould be aoted. Thc> hazards 
discussed in Chapter II, the model weaknesses from (Ttaptcrs III and IV, 
and the data collection issues of Chapter V must all be considered. 
Further » it is tiie position ot the au^iors that conclusive general i ;:at ions 
about a project are possible only aft,^*r\ amassing consistent evidence from 
a variety of evaluations over a period c^f time. No single trvout c<in pro- 
vide a sound. basis for generalizations no matter how iarefullv.it is con- 
ducted . 

The evaluation of projects within an opt^rat i'^nal , educational system 
is <in extremely difficult task and decision makers need to become <iware of 
the practical limitations of the process, (u ten, compl'icat ion > brvond tiK 
control of the evaluator preclude any definitive c (inclusions ihout [project 
effectiveness, and it is the responsibility of tiu> evaluatc^r M> retle»l 
this situation accurately in the evaluation report. 

Finally, it should be noted th<it all ot t ho mnl* ] . in tius ,;.ti hht;ol 
are directed <it tiie question of how nucn hotter ^tuti"at^ w^: Tar proi^-i-t 
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than they would havt; done without it. Decision makers however, may be 
intertfSted in some other criterion, such as bringing the mean scores of 
treatment students up to the national norm. This particular criteria Is 
widely eacountered, and while it may represent a meaningful goal, a word 
of caution is in order. Wliile every evaluator will recognize that exactly 
iialf of all students will always be below the national average, it is never 
safe to assume that the decision maker understands this statistical truism. 
A bric'f discussion of the issue, including the reasonableness of the cri- 
terion for the particular treatment students in the project, should always 
aco)mpany anv leference to j-uch a criterion. 

Report in_^ 

Ihe evaluation report is the final link in the evaluation process. 
Unless the results are adequately presented, the entire evaluation is of 
little use to anyone. A variety of people will be interested in the results 
and, ideally, a separate report should be prepared' for each type of audience, 
Fn practice, h'owever, only one report will be written and it should cover 
the requirements of a wide range of readers. The recommendations below 
assume at leat^t i ^ > basio audiences: (a) the local school board and ad- 
minlst rat i'r.> , ' - ' ' } educators, e^overnmcnt officers, and school personnel 
*^uts;df u- 1 oca 1 ' d is t r i i t . The first group will include non-specialists 

"^.f.ere^led in an easily understood description of the project 
. ^^t ir :^, liii- second v;roup will include skeptical evaluation specialises 
v;ao nui.>t h.- convinced that the findings are valid. lo neet the needs of 
i t Irst ;'r>up, a cloar siinmar^v' of th^^* proje*-^ t '.'v Mj^ults should be 
provided. I n . ;tinimar r snocWi r»ot he more th * or three pages long 

and should !>c iiuUidod at t .>ut of the r» The body of the report 

>ni uhi be <.onci.^e, hut ffi.'lete, hi -^rd^r N> ..ct the needs of the critical 
evaluation spt.cialisi, i\ >• ' ,-r t te issues of objectives, costs, 
dud af feM p/e changes i . '^k \ i vement gains. Report organization 

and appropriate topic > * fiKr than iciuevement gains are discussed in detail 
i,i ^iav.c * ! '1 i e , '.ampeau, u Irickett (1970). Examples of appropriate section 
heading-i md ' Mraits ^ an hi. tound m any educational research journal. 

la pre-;ent ing <ichievemenL gains, a convinc ing report must explain 




exactly what was done in the evaluation, provide l^tatistics summarizing 
the results, and justify the conclusions of the evaluators. In preparing 
the description of what was done, it should be kept in mind tliat the 
critical reader will be concerned about all of the hazards in Chapter II 
of this guidebook and is likely to analyze the evaluation report system- 
atically for possible weaknesses (as in Tallmadge & Horst, 1974). Where 
information is missing, he will probably assume the worst. Ideally, all of 
the questions raised in Chapter II and in Tallmadge and Horst (1974), as 
well as those in Chapter IV specific to a particular model, should be 
anticipated and. discussed. 

At "a minimum, the report should include a brief description and 
justification of the model used, a summary of the data, and the results 
of significance tei^ts. A wide-spread error is the. omission of summary 
statistics that are required if the results are to be meaningful. In 
particular, evaluation reports often present only mean scores as evidence 
of effectiveness. While means alone may be sufficient in a report summary, 

score reported in the body of a report should be accompanied 
by the number of students represented (N) , and the ^t^ndard deviation of 
the distribution (s). In addition, it must always h.- clear whether or not 
any two means represent exactly the same group of j>Mi.?'Mt s. Claims of sta- 
tistical significance should clearly elaborate (or reiviK the exact 
test used, as well as the numerical results of the test. Di^ • ions of 
educational importance should clearly indicate the local standards .^/ainst 
which the projtLt i^ compared. The local setting also bears on the extent 
to which the- pri^' ' : ht he replicable in otiier schtH>l districts, and 
should be spelled oa . lejrly as possible. 

The evaluator's m i '...jsion ^N)ncei» rhe sav'ing of inform^Uion 
from the eva j,/ution. Ihe , / ' ' !.-d r.-port v. i 1 1 provide summarized results, 
but many of the analyses and statistics recommended in this cliapter will 
not be included. it is not customarv, 1.^- ^'X<imple, to include graphs of 
::core distributions in a report unless they illustrate some special point. 
Most evaluators will, however, want to keep- these graphs plus all lalrulated 
statistics on file for future ref^^rence. Whether tlie rav d^u .i recording 
sheets are saved or not depends on local policy and on the possible use of 

90 



the data in evaluations during subsequent ^years , Providing the preliminary 
analyses of this chapter and the specific analyses of Chapter IV have been 
carefully completed and dpcximented , it is unlikely that the raw data will 
be needed for future reanalyses. 
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1 . Ca 1 i f o rnia_ A chievement Test (1970 Edition) 

A * Levels/ Gr ades/ Fo rms 

Level I / Grades 1.5-2 / Form A 
J.evel 2 / Grades 2-4 / Form A 
Level 3 / Grades 4-6 / Form A 
Level 4 / Grades 6-9 / Form A 
Level 5 / Grades 9-12 / Form A 

o * Normat ive Data Point 

February-March (beginning- and end-of-year nq^ms are projec- 
tions and should not be used in norm-referenced evaluations.) 

<; . Types of Scores 

Raw Scores (appropriate for use with Anchor Test Study Equi- 
valency Tables) 
Grade-equivalent Scores 

Achievement Development Scale Scores (expanded standard 
scores) (should be used for all statistical computations 
not involving Anchor Test Study conversions) 

Percentiles and Stanines (bogi^^^^irig- «^rid end-of-year scores 
;irf> projections and should not be used in. norm-referenced 
I 'la^ ions) 

reading scales of Levels 3 (Grades 4 and 5) and 4 
r.uie 6) were included in the Anchor Test Study. The 
1 \ T may thus he used for norm-referenced evaluations 
andor the following conditions: 

]. Pretest and posttest in late February (12-month 
interval) using CAT norms 
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2. Pretest and posttest in mid-April (12-month in- 
terval) using Anchor Test Study Individual Score 
Norm*. Reading only, and grades 4, 5, and 6 only 

3, Pretest in mid-October, posttest in mid-April 
Using Anchor Test Study Equivalency Tables* and 
Metropolitan Achievement Test norms. Reading 
only, and grades 4, 5, and 6 only. 



* The following procedure is recommended for use with Anchor Test Study 
data. First, convert each pupil's CAT raw score to the equivalent MAT 
raw score. Second, convert each MAT raw score to its corresponding stan- 
dard score. Third, calculate all statistics using MT standard scores. 
Then, if Anchor Test Study norms are to be used, convert the mean MAT 
standard score to its MAT raw score equivalent. The corresponding per- 
centile can then be read out of the Individual Score Norms Tables (not 
the School Means Norms Tables). If the MAT norms are to be used, percent 
equivalents are provided corresponding to mean standard scores. 



2. Cooperative Primary Tests (1965 Edition) 

A. Le ve 1 s / Gr ad e s / Fo r ms 

12 / Grades 1.5-2.0 / Forms A & B 
23 / Grades 2.0-3.9 / Forms A & B 

B. Normative Data Points 

Late October-early November and late April-early May 

C. Types of Scores 

Raw Scores 

Scale Scores (expanded standard scores) (should be used for 

all statistical computations) 
Percentiles 

D. Comments 

This test has appropriate norms for a fall pretest-spring 
posttest norm-referenced evaluation. It was not included 
in the Anchor Test Study because it does not cover grades 
4, 5, and 6. 
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3. Comprehensive Test of Basic Skills (196S Ecfition) 
A. Levels/Grades/ Forms 



Level 1 / Grades 2.5-4 / Forms Q & R 

Level 2 / Grades 4-6 / Forms Q ^ R 

Level J / Grades 6-8 I' Forms Q R 

Level 4 / Grades 8-10 / Forms Q 6. R 

^ • Normat iv e Da ta Po Int 

Last week of February-first week of March (Beginning- 
and End-of-year norms are projections and should not 
be used in norm-referenced evc.luations. 



^- T^^pes o{ Scores 

Raw Scores (appropriate for use with Anchor Test Study 

Equivalency Tables) 
Grade-equivalent Scores 

Expanded Standard Scores (should be used for all statis- 
tical computations not involving Anchor Test Study con- 
versions) 

Percentiles and Stanines (Beginning- and End-of-year 
scores are projections and should not be used in norm- 
referenced evaluations. 



D. Comments 

The reading scales of Level 2, Form Q (Grades 4 and 5) 
and Level 3, Form Q (Grade 6) were included in the Anchor 
Test Study. The CTBS may thu.-- be used for norn-ref erenced 
evaluations under the following conditions: 

1. Pretest and po-^ * at end of February-beginning 
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of J'tarch (12-inonth interval) using Anchor Test 
Study Individual Score Norms* in reading only, 
and grades 4, 5, and 6 only. 

3. Pretest in mid-October, posttest in mid-April 
using Anchor Test Study Equivalency Tables* 
and Metropolitan Achievement Test norms, Reading 
only, and grades 4, 5, and 6 only. 



* Procedures recommended for using Anchor Test Study Equivalency Tables 
and norms with the California Achievement Test are presented in the foot- 
note on page 94 . The same procedures should be used with Form Q of the 
CTBS. If Form R of the CTBS is used, each raw score must be converted 
to its Form Q equivalent (using conversion tables provided by the publisher) 
before the Anchor Test Study tables are used. 
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4. Gates-MacGinitie Reading Tests (1964 Edition) 

> 

A. Level s/Grades/ Forms 

Primary A / 1.5-2.0 / 1, XM, 2, 2M 

Primary B / 2.0-3.0 / 1, IM, 2, 2M 

Primary C I 3.0-4.0 / 1, IM, 2, 2M 

Primary CS/ 2.5-4.0 / 1, IM, 2, 2M, 3, 3M 

Survey D / 4.0-7.0 / IM, 2M, 3M 

Survey E / 7.0-10.0/ IM, 2M, 3M 



B. Normative Data Points 

October and April except January for first grade. (Feb-\ 
ruary and May norms are projections." Because of the 
proximity of the May norms to the April data point, the 
May norms are probably adequate for use with norm-refer- 
enced comparisons. The February norm, however, cannot 
be recommended for use with such comparisons.) 

C. Types of Scores 

Raw Scores (appropriate for use with Anchor Test Study 

Equivalency Tables) 
Grade Scores 

Standard Scores (should be used for all statistical compu- 
tations not involving Anchor Test Study conversions) 

>/ 

D. Comments 

The standard scores provided for the Gates-MacGinitie are 
not expanded standard scores. It is thus not possible to 
relate scores from one level of the test to norms for 
another level, and using test levels with appropriate 
norms may produce ceiling or floor effects when disad- 
vantaged or gifted students are tested, (See Hazard 4, p. 15.) 
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Survey D, Form IM was included in the Anchor Test Study, 
The Gates-MacGinitie may thus be used for norm-referenced 
evaluations under the following conditions: 

1, Pretest in mid-October, posttest in mid-^^ay using 
Gates-MacGinitie norms (but with ths possibility 
that ceiling and floor effects may be encountered) 

2, Pretest and posttest in mid-April (12-month interval) 
using Anchor Test Study Individual Score Norms*, 
Grades 4, 5, and 6 only. 

3, Pretest in mid-October and posttest in mid-April 
using Anchor Test Study Equivalency Tables* and 
Metropolitan Achievement Test norms. Grades 4, 
5, and 6 only. 



* Procedures recommended for using Anchor Test Study Equivalency Tables 
and norms with the California Achievement Test are presented in the foot- 
note on page 94, The same procedures should be used with Form IM of the 
Gates-MacGinitie, The Implication of using other forms is not clear as 
score equivalency tables are not provided by the publishers despite the 
probable existence of between-form differences. The test publishers ap- 
parently presume that the differences are so small as to be negligible. 



105 

99 



' 5. lova Test of Basi c Skills (1971 Edition) 
A. Levels/Grades/ Forms 



Level 


7 / 1 


.7- 


-2 


.5 


/ 


Forms 


5 


& 


6 


Level 


8 / 2 


.6- 


-3 


.5 


/ 


Fo rms 


5 


& 


5 


Level 


9 / 3 


.0- 


-3 


.9 


/ 


Forms 


5 


& 


6 


Level 


10 / 4 


0- 


-4 


,9 


/ 


Forms 


5 


& 


6 


Level 


11/5 


0- 


-5 


9 


/ 


Fo rms 


5 


& 


6 


Level 


12/6 


0- 


-6 


9 


/ 


Forms 


5 


& 


0 


Level 


13 / 7 


0- 


-7 


9 


/ 


Fo rms 


5 


& 


6 


Level 


14 / 8 


0- 


-8 


9 


/ 


Fo'frms 


5 


& 


6 



B. Normative Data Point 

Last half of October, first half of November (Mid-year and 
Spring norms are projections and should not be used for 
norm- referenced evaluations) 

C. Types of Scares 

Raw Scores (appropriate for use with Anchor Test Study Equiv- 
alency Tables) 
Grade-equivalent Scores 
Age-equivalent Scores 

Standard Scores (expanded standard scores) (should be used 

for all statistical computations not involving Anchc»r Test 

Study conversions) 
Percentiles and Stanines (Mid-year and Spring scores; are 

projections and should not be used for norm--ref erunced 

evaluations) 

D . Comments 

The reading scales of Levels 10 (Grade 4), 11 (Grade 5), 
and 12 (Grade 6), Form 5 were includcid in the Anchor Test 
Study* The ITBS may thus be used for norm-'refereni ed evalu- 
ation under the following conditions: 

106 

100 



1. Pretest and posttest in late October-early November 
(l2-moral4 interval) using ITBS norms 

2. Prett'-.i cim\ po^ru ^t in mid-April (12-month interval) 
using Anclior lest Study Individual Score Norms. "'^ 
Readin>^ only, and grades 4, 5, 6 6 only. ^ 

3. Pretest in mid-October and posttest in mid-ApriT 
using Anchor Test Study Equivalency i'abU--* and 
Metropolitan Achievement Test norms. Headinr> <>tiK\ 
and grades 4, 5, & 6 only. 



/ 



* Procedures recommended for using Anchor Test Study Equivalency Tables 
and norms with the California Achievement Test are presented on page 94. 
The same procedures should be used with Form 5 of the ITBS. The implica- 
tions of using other foWs is not clear as score equivalency tables are 
not provided despite th^ fact that some between-form differences are 
preiient. The test publishers apparently presume that the differences are 
so small as to be negligible. 
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^* Ley Is/^'.rade s / Fo rms 

Primary 1 ^ / 1.5-2.4 / P\ G, H 

Primary 2 / 2.5-3.4 / F, G, H 

Elementary / 3.5-4.9 / F, G, H 

Intermediate / 5.0-6.9 / F, G, H 

Advanced / 7.0-9.5 / F, G, Fi 

E. Norm ative Data Points 

Mid-October and mid-April 

< . I V|H". ot S( ores 
K,iw S( or t'j* 

r.rade-^equivaK'iU t*o!* . 

Standard Scores (expanded standard SLOtts) (should bv usod 

for all statistical computations) 
Percentiles and Stanines 

D. Comments 

The reading scales of Form F of the Elementary (Grade 4) 
and Intermediate (Grades 5 and 6) Levels were included 
in the Anchor Test Study. The MAT may thus be used for 
norm-referenced evaluation under the following conditions: 

1. Pretest in mid-October and postte'st in mid-April 
using iMAT norms . 

2. Pretest and posttest in mid-April (12-month interval) 
using Anchor Test Study Individual Score Norms*. 
Reading only, and grades 4, 5> and 6 only. 



* If Anchor Test Study norms are to be used, convert the mean MAT standard 
score to its raw score equivalent. The corresponding percentile can then 
be read out of the Individual Score Norms Table (not the School Means Norms 
Tables). If the :-L\T norms are to be used, percentile equivalents are pro- 
vided corresponding to mean standard scoreK. 
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7' Sequential Tests of Educational Progress II (1969 Edit ^^^^ 

A. Le^els/Crades/'Forms 

A / 4-6 / A, B 
3 / 7-9 / A, B 
2 / 10-12 / A, B 

B. Normative Data Point 

Last week in April, first three weeks in May (Fall norms 
are identical to the spring norms for the previous grade. 
As such, they should not be used in norm-referenced eval- 
uations . 

C. Types of Scores 

Raw Scores (appropriate for use with Anchor Test Study 

Equivalency Tables) 
Cpnverted Scores (expanded standard scores) (should be used 

for all statistical computations not involving Anchor 

Test Study conversions) 
Percentiles and Stanines (Fall scores are projections vund 

should not be used in norm-referenced evaluations) 

D. Comments 

The reading scales of Level A> Form A, were included in 
the Anchor Test Study. STEP II may thus be used for norm- 
referenced evaluations under the following conditions: 

1. Pretest and posttest in early ^^ay (12-month 
interval) using STEP II norms. 

2. Pretest and posttest in mid-April (12-month 
interval) using Anchor Test Study Individual Score 
Norms*. Reading only, and grades 4, 5, & 6 only. 



* Procedures recommended for using Anchor Test Study Equivalency Table^ 
and norms with the California Achievement Test are presented. in the footnote 
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3. Pretest in mid-Octoher , posttesr in mid-April usin^ 
Anchor Test Study Equivalency Tables* and Metropolitai 
Achievement Test norni,s. Reading only, and grades 
4» 5, Si 6 only. 



on page 94. The same procedures should be used with Form A of STEP 11, 
If Form B is used, each raw score must be converted to its Form A equiv- 
alent (using conversion tables provided by the publisher) before the 
Anchor Test Study Tables are used. 
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8. 



Ach ievement Series (1971 E dition) 



A. I.evels/Grades/ Forms 



Primary I / 1.0-5.5 / E, F 

Primary II / 1.0-5.9 / E, F 

Blue / 3.5-8.5 / E, F 

Green / 4.5-9.9 / E, F 

Red / 6.5-10.5/ E, F 



B. Normative Data Point 



Mid-April (Beginning- and m iddle-of-year norms are projections 
and should not be used in norm-referenced evaluations) 



Raw Scores (appropriate for use with Anchor Test Study 

Equivalency Tables) 
Grade-equivalent Scores 
Growth Score Values 

Percentiles and Stanines (Beginning- and Middle-of-year 
scores are projections and should not be used in norm- 
referenced evaluations) 



Form E of the Blue level (Grades 4 and 5) and the Green 
level (Grade 6) were included in the Anchor Test Study. 
The SRA Achievement Tests may thus be used for norm-ref- 



1. Pretest and posttest in mid-April (12-month interval) 
using SRA Achievement norms. 

2. Pretest and posttest in mid-April (12-month interval) 
using Anchor Test Study Individual Score Norms*. 
Reading only, grades 4, 5, & 6 only* 



C. Types of Scores 



D. Comments 



erenced evaluations under the following conditions: 



* Procedures recommended for using Anchor Test Study Equivalency Tables 
and norm> with the California Achievement Test are presented in the footnote 
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3. Pretest in mid-October and post test in mid-April 
using Anchor Test Study Equivalency Tables'''^ and 
Metropolitan Achievement Test norms. Reading 
only, grades 4, 5, & 6 only. 



on page 94. The same procedures should be used with Form E of the SRA 
Achievjpment Tests* The implication of using Form F is not clear as score 
equivalency tables are not provided by the publishers despite the probable 
existence of between-form differences* The test publishers apparently 
presume that the differed ;es are so small as to be negligible* 
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Stanford Achievement Tests (1973 Edition) 



A. Levels/Grades/Forms 

Primary I / 1.5-2.8 / A, B, C 

Primary II / 2.5-3.8 / A. B, C 

Primary III / 3.8--4.8 / A, B, C 

Intermediate I / 4.8-5.8 /A, B, C 

Intermediate II / 5.8-7.8 / A, B, C 

Advanced / 7.1-9.8 / A, B, C 

B. Normative Data Points 

October, ^ February, and May (Most of the SAT percentile and 
stanine norms tables are closely tied to empirical data. 
The following, however, are projections and should not 
be used for norm-referenced evaluations: Primary II, 
grade 3.5; Primary III, grades 3.5 and 4.5; Intermediate^ 
I, grades 4.5 and 5.5; Intermediate II, grades 6.5 and 
7.5; Advanced, grades 7.5, 8.5, and 9.5.) 

C. Types of Scores 

Raw Scores 

Grade-equivalent Scores 

Scaled Scores (expanded standard scores) (should be used 
for all statistical computations) 

Percentiles and Stanines (percentiles and stanines obtained 
from the projected norms tables listed above should not 
be used for norm-referenced evaluations.) 



D. Comments 

An earlier edition of the Stanford Achievement Tests (1964) 
was included in the Anchor Test Study. The new edition, 
however, has many advantages over the old and should he 
preferred despite the fact that it cannot be used in con- 
junction with the Anchor Test Study Equivalency Tables. 
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iVPPENDIX B 



Analysis of Covariance Worksheets 

Analysis of covariance is both theoretically and computationally 
complex. An evaluator undertaking this analysis should have access to 
a good reference book describing the approach in detail. Tatsuoka (1971, 
Ch. 3) and McNemar (1969, Ch. 18) provide readable explanations of the 
model. A more complete development is available in Winer (1971, ch. 10). 
Because of the amount of computation involved, the use of a computer 
is highly desirable. Appropriate programs can be provided by most com- 
puter centers. 

liHiere the amount of data is small or computer facilities are un- 
available, the calculations can be done by hand. This appendix provides 
a set of worksheets for simplifying the computational work. The work- 
sheets are referenced directly to the numerical example 'in Winer (1971, ' 
p. 775) and preserve his notation, but are revised for the case of two 
groups (treatment plus compar ison) . Since the textbook example is for 
three groups, it is not directly applicable to the typical project eval- 
uation. 

Four worksheets are provided: 

W o r k s h e e t One is used to record intermediate results that are used for 
the remaining calculations. All of the terms in columns one and two 
will be available from the preliminary analyses recommended in Chapter 
VI . 

Worksheet Two is used to arrive at the basic test of significance of the 
project effects. 

Worksheet Three is used to test whether the regression lines for the 
two groups'' have the same slope. If the. F ratio for the regression lines 
is significant (i.e., the two slopes are not equal) then analysis of 
covariance shouxd not be used, and the F ratio from Worksheet Two is 
meaningless. Logically, Worksheet Three should be completed before 
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Worksheet Two. Only items (202), (205), and (208) f rem Worksheet Two 
are needed to complete Worksheet 'Ihree. 

Work sheet Four is used to calculate the adjusted mean posttc-^t scores. 
These adjusted scores are used only to provide an e,stimate ot the "real" 
effect of the project. They may be useful in determining "educational" 
significance, but are not involved in the computation of statist i<.al 
significance . 

Significance Levels' 

Tables of F values are available in McNemar (1969, pp 509-511) and Winer 
(1971, pp 864-868). In McNemar; 

ni = degrees of freedom (df) for the numerator 
n2 = degrees of freedom (df) for the denominator 

fhe .05 level of significance is suggested in this guidebook. Winer 
uses the notation (1-*) = .95 for the .05 level of significance. 

Notation Used on the Worksheets^: 

1 = student number 

j = group ID (i.e., j = Treatment (t) or Comparison (c)) 

~ pretest raw score for student i of group j 

Y^j = posttest raw score for student i of group j 

n^ = number of students in group J 

N - total number of students (N" = n^ + n^) 

The remaining notation in this appendix follows Winer (1971). It may be 
helpful to note that in Winer: 

S refers to "total" variation 

E refers to "error" or "within group" variation 

T refers to "treatment" or "between groups" variation 
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and that on page 775 (omitting subscripts): 



(Ix) = 
(2x) = 
(3x) = 



11X2 



(Ixy) = 



(zzx)an-} 



(2xy) = LZXY 



(3xy) 



= If 



.1 n. 



(ly) 



azYy 



(2y) = EZY^ 



(3y) = I 



n . 



The double summation signs (ZZ) indicate that the values are first summed 
over all n^ students in each group, then the two group sums are added 
together. 

On all worksheets, results which are needed for later calculations are 
identified by a three-digit number. The number (148), for exampJLe, in- 
dicates Worksheet One, Column 4, Row 8. Worksheets Two and Three are 
not divided into columns, so, for example, (212) indicates Worksheet Two, 
Row 12. 

There are no mathematical checks built into the worksheets. To insure 
accuracy it is essential to have two persons complete the calculations 
independently. 
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ANALYSIS OF COVARIANCE 
Worksheet Two (Winer, 1971. pp. 775-778) 

Computation of F ratio for the significance of the 
adjusted difference between the Treatment and Comparison groups 



XX 



(142) - (141) 
(142) - (143) 
(201) - (202) 



(201) 
(202) 
(203) 



xy 



xy 



xy 



(14.8) - (147) 
(148) - (149) 
(204) - (205) 



(204) 
(205) 
(206) 



yy 



yy 



yy 



(145) - (144) 
(145) - (146) 
(207) - (208) 



(207) 
(208) 
(209) 



S' 



yy 



E' 



yy 



yyR 



(207) - (204)2/(20]) 

(208) - (205)2/(202) 
(210) - (211) 



(212) [(130)-3] 
(2]1) 



J - 



(210) 
(211) 
(212) 



(213) 



degr«i'es of freedom: 



I(n.-1)-1 



1 numerator 
N-3 denominator 
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